[PATCH] D13443: Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline.

Tue Oct 6 09:21:13 PDT 2015

> On Oct 6, 2015, at 7:05 AM, Teresa Johnson <tejohnson at google.com> wrote:
> 
> Hi Mehdi,
> 
> Thanks for sharing the results. As you note there are swings in both
> directions, but the improvements outweigh the regressions.

Yes the results are swinging, but I'm tracking these as a "bug" to be fixed. 
Two weeks ago there were even more regressions and I recovered some of them with http://reviews.llvm.org/D13390 (to be committed soon). This was an example were the LTO pipeline was “better” just “by chance”.

> 
> On Mon, Oct 5, 2015 at 5:50 PM, Mehdi AMINI <mehdi.amini at apple.com> wrote:
>> joker.eph added a comment.
>> 
>> Right now my view of it is that if I get a performance improvement by running two times the inliner and the "peephole" passes, then it is a bug. If it is not a bug it means that the O3 pipeline is affected as well and we might run it two times there as well. Does it make sense?
> 
> I wonder if there are aspects of the inliner that work differently
> when run twice vs once. E.g. only 1 level of recursive inlining is
> allowed currently, but running it twice would allow 2 levels of
> recursive inlining. That may not be a big factor, but just an example
> where there is going to be a difference running it twice vs once.

Interesting. I don’t know enough about the inliner to have a definitive opinion, but it would seems weird to me that the “optimal” solution would be different for LTO than O3.

> 
> Another factor might be that doing the intermediate peephole
> optimizations (which are currently run after the compile step
> inlining), could be cleaning up the code and reducing some of the
> inlining costs for the LTO round of inlining.

It depends what you mean by “intermediate peephole”

From a very high level point, I see the O2/O3 pipeline organized this way :

- minor cleanup (populateFunctionPassManager)
- cleanup + globalopt 
- inlining + peephole (in the same CGSCC PM)
- optimizations

And for LTO what I did is:

- minor cleanup (populateFunctionPassManager)
- cleanup + global opt
# end of compile phase
# start of LTO phase on the linked module
- cleanup + global opt + constantmerge
- inlining + peephole (in the same CGSCC PM)
- globalopt + globaldce + peephole again
- optimizations

> 
> For LTO specifically, I wonder how the peak memory usage is affected
> (e.g. like we were discussing with the bitcode size, it will see some
> larger functions due to the earlier inlining, but also potentially
> fewer or smaller functions if the code has been inlined and cleaned up
> prior).

I’ll try to check that on our compile-time test-suite.

> 
>> 
>> I ran the LLVM benchmark suite + some internals with a return before and after the inliner+peephole phase. Stopping before the inliner during the compile phase ends up with 13 regressions and 20 improvements, compared to running the inliner during the compile phase. I sent you some more details by email.
> 
> Just to clarify on those results - for the "Previous(1)" which is
> stopping after the inlining, are you just removing that early return
> from populateModulePassManager? If so, did you put the call to
> createEliminateAvailableExternallyPass back under the
> if(!PrepareForLTO) guard? There's probably some other stuff like
> unrolling and vectorization that as you note would be
> counterproductive to run prior to LTO.

No I didn’t just remote the return, I moved it after the inliner+peephole CGSCC PM, so that the pipeline becomes.

- minor cleanup (populateFunctionPassManager)
- cleanup + global opt
- inlining + peephole (in the same CGSCC PM)
# end of compile phase
# start of LTO phase on the linked module
- cleanup + global opt + constantmerge
- inlining + peephole (in the same CGSCC PM)
- globalopt + globaldce + peephole again
- optimizations

> 
> Rather than exhaustively find the right combination, a couple of data
> points seem particularly interesting: 1) performance effects of just
> adding the inlining (and none of the other later opts after your early
> return) and exiting right after in the PrepareForLTO case;

I think 1) is what I did just above right?

> 2)
> performance effects of doing the peephole passes before exiting early
> in the PrepareForLTO case (so you get the code cleanup before the LTO
> inlining that might be affecting it's cost analysis).

Can you explain a bit more, I’m not sure I understand what you mean here? The peephole will run as part of the same CGSCC the inliner is part of, especially to cleanup callees before processing the inliner on a caller.
(It is probably suboptimal inside a single SCC by the way, but I don’t see how the current PM or the new one can solve this).

— 
Mehdi