[clang] [llvm] [CGData][ThinLTO] Global Outlining with Two-CodeGen Rounds (PR #90933)

Mon Sep 30 04:29:52 PDT 2024

kyulee-com wrote:

> * Looking at the NFC, this seems like it has very similar issues to Propeller, which wants to redo just the codegen with a new injected profile and BB ordering. It would be good to see if we can converge to similar approaches. I asked @rlavaee to take a look and he is reading through the background on this work. @rlavaee do you think Propeller could use a similar approach to this where it saves the pre-codegen bitcode and re-loads it instead of redoing opt? This isn't necessarily an action item for this PR, but I wanted Rahman to take a look since he is more familiar with codegen.

It's interesting to know that Propeller wants to redo the codegen. I'm happy to align with this work. We've already started discussing this and have shared some details from our side. Here's the link for more info: https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753/11?u=kyulee-com.

> * I think this should be doable to implement with distributed ThinLTO if we want in the future, from what I can tell. But we'll need a way to save the bitcode before codegen from a non-in-process backend (e.g. the thinBackend invoked from clang). Not asking you to do anything for this PR on that, but just thinking through it. Seems doable...

I think technically it's doable, but initially, I believed we did not want to repeat the code generation with distributed ThinLTO unless we were willing to introduce additional synchronization and spawn distributed ThinLTO backends again with the merged codegen data. If we aim to leverage the saved optimized IR, I suppose we need to assign the same backend work to the same machine used in the first run.
As commented above in the link, I wanted to confine repeating the codegen is for a single machine (In-process backend). I mainly wanted to separate the writer/reader builds for the use of distributed ThinLTO builds  to avoid this complication while allowing some stale codege data. However, it's certainly worth experimenting with.

> * My biggest concern about this patch as written is whether it will break down under LTO's caching mechanism - have you tested it with caching? It seems like it might just skip the backend completely since you aren't adding anything to the cache key.

That's a good catch! Assuming this is not a common scenario (as mentioned in the link above RFC), my initial design intentionally disables LTO's caching for correctness and simplicity by setting an empty `FileCache()` in the constructor for both the first and second round backends. To enable proper caching, we need two additional caches/streams, in addition to one for the final object output: one for the scratch object files and another for the optimized IR files. I've managed to implement this with some refactoring, details of which I will follow.

https://github.com/llvm/llvm-project/pull/90933