[clang] [llvm] [CGData][ThinLTO] Global Outlining with Two-CodeGen Rounds (PR #90933)

Mon Sep 30 09:23:05 PDT 2024

teresajohnson wrote:

> > * Looking at the NFC, this seems like it has very similar issues to Propeller, which wants to redo just the codegen with a new injected profile and BB ordering. It would be good to see if we can converge to similar approaches. I asked @rlavaee to take a look and he is reading through the background on this work. @rlavaee do you think Propeller could use a similar approach to this where it saves the pre-codegen bitcode and re-loads it instead of redoing opt? This isn't necessarily an action item for this PR, but I wanted Rahman to take a look since he is more familiar with codegen.
> 
> It's interesting to know that Propeller wants to redo the codegen. I'm happy to align with this work. We've already started discussing this and have shared some details from our side. Here's the link for more info: https://discourse.llvm.org/t/rfc-enhanced-machine-outliner-part-2-thinlto-nolto/78753/11?u=kyulee-com.

Great, I hadn't looked at the RFC again and didn't realize Rahman already responded there.

> 
> > * I think this should be doable to implement with distributed ThinLTO if we want in the future, from what I can tell. But we'll need a way to save the bitcode before codegen from a non-in-process backend (e.g. the thinBackend invoked from clang). Not asking you to do anything for this PR on that, but just thinking through it. Seems doable...
> 
> I think technically it's doable, but initially, I believed we did not want to repeat the code generation with distributed ThinLTO unless we were willing to introduce additional synchronization and spawn distributed ThinLTO backends again with the merged codegen data. If we aim to leverage the saved optimized IR, I suppose we need to assign the same backend work to the same machine used in the first run. As commented above in the link, I wanted to confine repeating the codegen is for a single machine (In-process backend). I mainly wanted to separate the writer/reader builds for the use of distributed ThinLTO builds to avoid this complication while allowing some stale codege data. However, it's certainly worth experimenting with.

Agree it makes sense to do the in-process mode support first as that is more straightforward. 

> If we aim to leverage the saved optimized IR, I suppose we need to assign the same backend work to the same machine used in the first run.

Not necessarily. The IR just needs to be saved in a path that the build system specifies and can therefore move around.

> 
> > * My biggest concern about this patch as written is whether it will break down under LTO's caching mechanism - have you tested it with caching? It seems like it might just skip the backend completely since you aren't adding anything to the cache key.
> 
> That's a good catch! Assuming this is not a common scenario (as mentioned in the link above RFC), my initial design intentionally disables LTO's caching for correctness and simplicity by setting an empty `FileCache()` in the constructor for both the first and second round backends. To enable proper caching, we need two additional caches/streams, in addition to one for the final object output: one for the scratch object files and another for the optimized IR files. I've managed to implement this with some refactoring, details of which I will follow.

Yeah, there are a few caching aspects. Adding caching of the scratch files and optimized IR being reused here is one aspect that will enable more caching and better build performance for this optimization. I was more concerned about the existing LTO caching and whether that would be broken, but missed that you were previously disabling caching. We use distributed ThinLTO but my understanding is that LTO caching is frequently used for in-process ThinLTO. I'll look at your refactoring and updates as soon as I can.

https://github.com/llvm/llvm-project/pull/90933