[PATCH] D146776: [llvm] Preliminary fat-lto-objects support

Thu Apr 27 08:30:27 PDT 2023

tejohnson added a comment.

In D146776#4301107 <https://reviews.llvm.org/D146776#4301107>, @aeubanks wrote:

> for your use case, is it ok for performance to not be 100% if you go down the non-LTO route? presumably if you're not using LTO you're not caring about squeezing out 100% performance. for example in Chrome-land we've talked about this sort of thing for tests where we don't really care if performance is optimal, so an approximation of -O2 (or whatever) is good enough. if we solidify this as a tradeoff and don't promise 100% compatibility with the default optimization pipeline, I'm much more comfortable with that as long as people don't complain if FatLTO non-LTO performance ever slightly degrades

I agree ensuring full performance with FatLTO's non-LTO native object isn't terribly important. But my concern is whether this results in some unexpected behavior with certain passes not being run at all - e.g. ICP won't run at all if you are doing SamplePGO in this configuration.

In D146776#4301320 <https://reviews.llvm.org/D146776#4301320>, @nikic wrote:

> Right, I think we basically have two options on how to do this:
>
> 1. Make fat LTO exactly equivalent to LTO bitcode + default-optimized object code. This would require cloning the input module, running the LTO pipeline on one module, embedding the bitcode in the other and then running the default pipeline in that one. This will robustly match the LTO/default behavior exactly, at the cost of compile-time.
> 2. Make fat LTO approximately equivalent to LTO bitcode + default-optimized object code, with the usual cost-benefit tradeoffs. In this case, I don't think that rerunning module simplification after embedding bitcode makes much sense, because the differences between the pre-link ThinLTO pipeline and the default pipeline are too minor.
>
> I believe the differences are:
>
> - Various bits related to SamplePGO.
> - The (currently disabled) MemProfiler pass. This looks like a plain bug to me: What is an instrumentation pass doing in the simplification pipeline? I'm pretty sure this should be in the optimization pipeline instead. Even if not, we could easily explicitly run this one, because it's at the very end of the simplification pipeline.

I added this. As with other profile instrumentation passes it is off by default. I think I added it here because this is also where we add the PGO instrumentation passes, although I suppose there are specific reasons for doing those here (i.e. inlining is also invoked here). I can move it out so this becomes a non-issue.

> - A single GlobalDCE pass at the end of the pipeline. I'm pretty sure this is another bug, because the comment says "Remove dead code, except in the ThinLTO pre-link pipeline where we may want to keep available_externally functions", even though available_externally functions only become relevant post-link, not pre-link.

ThinLTO importing is not the only producer of available_externally functions. There can be available_externally functions coming out of clang, and we want them to survive until after importing and the post LTO link inlining is done.

> - Loop rotation will not duplicate calls in headers pre-link.

This seems minor I agree.

> Ignoring the bugs, the only substantial differences seem to be related to SamplePGO, which I'm not familiar with. I'm not clear on how SamplePGO relates to fat LTO usage scenarios.

As mentioned earlier, I think the question is whether it will be a surprise when using SamplePGO that certain things simply don't get run when using the non-LTO native objects. I don't know enough about when these will get used to know if this is a configuration anyone will care about. If it is, I suppose one approach would be to invoke the longer pipelines only under SamplePGO. Either way, this probably needs some clear documentation.

================
Comment at: llvm/lib/Passes/PassBuilderPipelines.cpp:1489
+      buildModuleSimplificationPipeline(Level, ThinOrFullLTOPhase::None));
+  MPM.addPass(buildModuleOptimizationPipeline(Level, ThinOrFullLTOPhase::None,
+                                              /*IsFatLto*/ true));
----------------
Why pass in phase None for these calls to module simplification and optimization passes, instead of the corresponding post-LTO link phase? I would think we would want the complement of the phase provided earlier. I'm pretty sure there are some passes that are configured to run in exactly one of these phases, and we might end up duplicating them if running a pre-LTO link phase + a default (non-LTO) phase optimization pipeline.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D146776/new/

https://reviews.llvm.org/D146776