[PATCH] D69732: [WIP][LTO] Apply SamplePGO pipeline tunes for ThinLTO pre-link to full LTO

Tue Nov 5 06:28:40 PST 2019

tejohnson marked an inline comment as done.
tejohnson added a comment.

In D69732#1733447 <https://reviews.llvm.org/D69732#1733447>, @wristow wrote:

> > This probably needs to be taken over by someone who cares about full LTO performance
>
> We at PlayStation are definitely interested in full LTO performance, so we're looking into this.  We certainly agree with the rationale that if suppressing some optimizations is useful to allow better SamplePGO matching, then we'd expect that would apply equally to both ThinLTO and full LTO.
>
> I guess much of this comes down to a balancing act between:
>
> 1. The amount of the runtime benefit with Sample PGO if these loop optimizations are deferred to the full LTO back-end (like they are for ThinLTO).
> 2. The cost in compile-time resources in the full LTO back-end to do these loop optimizations at that later stage.
>
>   From the discussion here, the Sample PGO runtime win (point 1) seems more or less to be a given.  If we find the compile-time cost in the full LTO back-end (point 2) is not significant, then the decision should be easy.  So after seeing this patch, we're doing some experiments to at least try to get a handle on this.  (I'm a bit concerned we won't be able to draw any hard conclusions from the results of our experiments, but at least we'll be able to make a better informed assessment.)
>
>   FTR, for PlayStation, we're using the old PM.  But we'll do some experiments for both the old and new PM, to get a sense of the answers to the (old PM) `LoopUnrollAndJam` point, and the (new PM) FIXME comment.

This is a good summary. I look forward to your results.

================
Comment at: llvm/lib/Transforms/IPO/PassManagerBuilder.cpp:614
   // during ThinLTO and perform the rest of the optimizations afterward.
   if (PrepareForThinLTO) {
     // Ensure we perform any last passes, but do so before renaming anonymous
----------------
wristow wrote:
> wenlei wrote:
> > this also need to be `PrepareForThinLTO || PrepareForLTO` for oldPM?
> I agree this is another instance where a balancing act question applies.  In this case, assuming the comment about the concern of code bloat is accurate, it's not so much about compile-time resources in the full LTO back-end, but rather about minimizing the ThinLTO bitcode write/read time.  So if as this WIP evolves, it ultimately is a win for SamplePGO to suppress some loop optimizations (unrolling/vectorization) here, then that will probably also be a //small // win in full LTO compile time.  That said, in addition to these loop-related optimizations, there are other transformations here that are done in the full LTO pipeline (but not in the ThinLTO pipeline).  So I suspect if some change to check for `PrepareForThinLTO || PrepareForLTO` (rather than only `PrepareForThinLTO`) makes sense here from a Sample PGO perspective, then the change will be more complicated than simply adding the small set of passes here followed by the early return (that is, I think there are probably things after the `return` on line 621 that still ought to be enabled for full LTO -- essentially continuing to do them in the pre-link stage for full LTO, to try to avoid needing to do too much work in the full LTO backend stage, since it's more of a problem for the full backend to absorb that compile time cost).
This early return was not for Sample PGO btw. It was added much earlier with the thought that a) these types of optimizations might affect function importing heuristics because they could bloat the code; b) we can push more optimizations to the post-link in ThinLTO because it is parallel; and c) there isn't otherwise a benefit to doing these optimizations in the pre vs post link, i.e. they aren't cleanup/simplification passes.  The equation is of course different for full LTO which has a monolithic serial post link backend. But I believe this early return is the one @ormris is looking to remove on the ThinLTO pass to "merge" the two pipelines, which needs a good amount of evaluation on the ThinLTO performance side.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D69732/new/

https://reviews.llvm.org/D69732