[PATCH] D13443: Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline.

Tue Oct 6 11:26:18 PDT 2015

> On Oct 6, 2015, at 10:24 AM, Teresa Johnson <tejohnson at google.com> wrote:
> 
> On Tue, Oct 6, 2015 at 9:21 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>> 
>>> On Oct 6, 2015, at 7:05 AM, Teresa Johnson <tejohnson at google.com> wrote:
>>> 
>>> Hi Mehdi,
>>> 
>>> Thanks for sharing the results. As you note there are swings in both
>>> directions, but the improvements outweigh the regressions.
>> 
>> Yes the results are swinging, but I'm tracking these as a "bug" to be fixed.
>> Two weeks ago there were even more regressions and I recovered some of them with http://reviews.llvm.org/D13390 (to be committed soon). This was an example were the LTO pipeline was “better” just “by chance”.
>> 
>> 
>>> 
>>> On Mon, Oct 5, 2015 at 5:50 PM, Mehdi AMINI <mehdi.amini at apple.com> wrote:
>>>> joker.eph added a comment.
>>>> 
>>>> Right now my view of it is that if I get a performance improvement by running two times the inliner and the "peephole" passes, then it is a bug. If it is not a bug it means that the O3 pipeline is affected as well and we might run it two times there as well. Does it make sense?
>>> 
>>> I wonder if there are aspects of the inliner that work differently
>>> when run twice vs once. E.g. only 1 level of recursive inlining is
>>> allowed currently, but running it twice would allow 2 levels of
>>> recursive inlining. That may not be a big factor, but just an example
>>> where there is going to be a difference running it twice vs once.
>> 
>> Interesting. I don’t know enough about the inliner to have a definitive opinion, but it would seems weird to me that the “optimal” solution would be different for LTO than O3.
> 
> It may not be, it's just that they were different in the past and we
> are seeing that it had positive and negative effects, which I guess
> isn't a big surprise to me.
> 
>> 
>>> 
>>> Another factor might be that doing the intermediate peephole
>>> optimizations (which are currently run after the compile step
>>> inlining), could be cleaning up the code and reducing some of the
>>> inlining costs for the LTO round of inlining.
>> 
>> It depends what you mean by “intermediate peephole”
>> 
>> From a very high level point, I see the O2/O3 pipeline organized this way :
>> 
>> - minor cleanup (populateFunctionPassManager)
>> - cleanup + globalopt
>> - inlining + peephole (in the same CGSCC PM)
>> - optimizations
>> 
>> And for LTO what I did is:
>> 
>> - minor cleanup (populateFunctionPassManager)
>> - cleanup + global opt
>> # end of compile phase
>> # start of LTO phase on the linked module
>> - cleanup + global opt + constantmerge
>> - inlining + peephole (in the same CGSCC PM)
>> - globalopt + globaldce + peephole again
>> - optimizations
>> 
>> 
>> 
>> 
>>> 
>>> For LTO specifically, I wonder how the peak memory usage is affected
>>> (e.g. like we were discussing with the bitcode size, it will see some
>>> larger functions due to the earlier inlining, but also potentially
>>> fewer or smaller functions if the code has been inlined and cleaned up
>>> prior).
>> 
>> I’ll try to check that on our compile-time test-suite.
>> 
>>> 
>>>> 
>>>> I ran the LLVM benchmark suite + some internals with a return before and after the inliner+peephole phase. Stopping before the inliner during the compile phase ends up with 13 regressions and 20 improvements, compared to running the inliner during the compile phase. I sent you some more details by email.
>>> 
>>> Just to clarify on those results - for the "Previous(1)" which is
>>> stopping after the inlining, are you just removing that early return
>>> from populateModulePassManager? If so, did you put the call to
>>> createEliminateAvailableExternallyPass back under the
>>> if(!PrepareForLTO) guard? There's probably some other stuff like
>>> unrolling and vectorization that as you note would be
>>> counterproductive to run prior to LTO.
>> 
>> No I didn’t just remote the return, I moved it after the inliner+peephole CGSCC PM, so that the pipeline becomes.
>> 
>> - minor cleanup (populateFunctionPassManager)
>> - cleanup + global opt
>> - inlining + peephole (in the same CGSCC PM)
>> # end of compile phase
>> # start of LTO phase on the linked module
>> - cleanup + global opt + constantmerge
>> - inlining + peephole (in the same CGSCC PM)
>> - globalopt + globaldce + peephole again
>> - optimizations
> 
> Ok, this was what I was thinking about - since currently at HEAD that
> is what is happening and apparently causing the performance gyrations.

Currently at HEAD it is a bit different, we have:

# “Full" O2/O3 during compile phase
- minor cleanup (populateFunctionPassManager)
- cleanup + globalopt
- inlining + peephole (in the same CGSCC PM)
- optimizations (<— not present in my “current” and “previous”)
# end of compile phase
# start of LTO phase on the linked module
- cleanup + global opt + constantmerge
- inlining *but NO peephole"
- a few peephole passes (but far from as much as in the regular pipeline)
- a few optimizations (but far from as much as in the regular pipeline)

> 
>> 
>> 
>> 
>> 
>>> 
>>> Rather than exhaustively find the right combination, a couple of data
>>> points seem particularly interesting: 1) performance effects of just
>>> adding the inlining (and none of the other later opts after your early
>>> return) and exiting right after in the PrepareForLTO case;
>> 
>> I think 1) is what I did just above right?
> 
> Above is a combination of my 1) and 2).
> 
>> 
>>> 2)
>>> performance effects of doing the peephole passes before exiting early
>>> in the PrepareForLTO case (so you get the code cleanup before the LTO
>>> inlining that might be affecting it's cost analysis).
>> 
>> Can you explain a bit more, I’m not sure I understand what you mean here?
> 
> I was thinking of measuring the effects of that intermediate peephole
> in isolation.

To be sure, you want to see:

- minor cleanup (populateFunctionPassManager)
- cleanup + global opt
- inlining
# end of compile phase
# start of LTO phase on the linked module
- cleanup + global opt + constantmerge
- inlining + peephole (in the same CGSCC PM)
- globalopt + globaldce + peephole again
- optimizations

(peephole removed from the inliner CGSCC PM during compile phase)

> 
>> The peephole will run as part of the same CGSCC the inliner is part of, especially to cleanup callees before processing the inliner on a caller.
>> (It is probably suboptimal inside a single SCC by the way, but I don’t see how the current PM or the new one can solve this).
>> 
>> —
>> Mehdi
>> 

Mehdi