[PATCH] D13443: Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline.

Teresa Johnson via llvm-commits llvm-commits at lists.llvm.org
Tue Oct 6 10:24:46 PDT 2015


On Tue, Oct 6, 2015 at 9:21 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
>> On Oct 6, 2015, at 7:05 AM, Teresa Johnson <tejohnson at google.com> wrote:
>>
>> Hi Mehdi,
>>
>> Thanks for sharing the results. As you note there are swings in both
>> directions, but the improvements outweigh the regressions.
>
> Yes the results are swinging, but I'm tracking these as a "bug" to be fixed.
> Two weeks ago there were even more regressions and I recovered some of them with http://reviews.llvm.org/D13390 (to be committed soon). This was an example were the LTO pipeline was “better” just “by chance”.
>
>
>>
>> On Mon, Oct 5, 2015 at 5:50 PM, Mehdi AMINI <mehdi.amini at apple.com> wrote:
>>> joker.eph added a comment.
>>>
>>> Right now my view of it is that if I get a performance improvement by running two times the inliner and the "peephole" passes, then it is a bug. If it is not a bug it means that the O3 pipeline is affected as well and we might run it two times there as well. Does it make sense?
>>
>> I wonder if there are aspects of the inliner that work differently
>> when run twice vs once. E.g. only 1 level of recursive inlining is
>> allowed currently, but running it twice would allow 2 levels of
>> recursive inlining. That may not be a big factor, but just an example
>> where there is going to be a difference running it twice vs once.
>
> Interesting. I don’t know enough about the inliner to have a definitive opinion, but it would seems weird to me that the “optimal” solution would be different for LTO than O3.

It may not be, it's just that they were different in the past and we
are seeing that it had positive and negative effects, which I guess
isn't a big surprise to me.

>
>>
>> Another factor might be that doing the intermediate peephole
>> optimizations (which are currently run after the compile step
>> inlining), could be cleaning up the code and reducing some of the
>> inlining costs for the LTO round of inlining.
>
> It depends what you mean by “intermediate peephole”
>
> From a very high level point, I see the O2/O3 pipeline organized this way :
>
> - minor cleanup (populateFunctionPassManager)
> - cleanup + globalopt
> - inlining + peephole (in the same CGSCC PM)
> - optimizations
>
> And for LTO what I did is:
>
> - minor cleanup (populateFunctionPassManager)
> - cleanup + global opt
> # end of compile phase
> # start of LTO phase on the linked module
> - cleanup + global opt + constantmerge
> - inlining + peephole (in the same CGSCC PM)
> - globalopt + globaldce + peephole again
> - optimizations
>
>
>
>
>>
>> For LTO specifically, I wonder how the peak memory usage is affected
>> (e.g. like we were discussing with the bitcode size, it will see some
>> larger functions due to the earlier inlining, but also potentially
>> fewer or smaller functions if the code has been inlined and cleaned up
>> prior).
>
> I’ll try to check that on our compile-time test-suite.
>
>>
>>>
>>> I ran the LLVM benchmark suite + some internals with a return before and after the inliner+peephole phase. Stopping before the inliner during the compile phase ends up with 13 regressions and 20 improvements, compared to running the inliner during the compile phase. I sent you some more details by email.
>>
>> Just to clarify on those results - for the "Previous(1)" which is
>> stopping after the inlining, are you just removing that early return
>> from populateModulePassManager? If so, did you put the call to
>> createEliminateAvailableExternallyPass back under the
>> if(!PrepareForLTO) guard? There's probably some other stuff like
>> unrolling and vectorization that as you note would be
>> counterproductive to run prior to LTO.
>
> No I didn’t just remote the return, I moved it after the inliner+peephole CGSCC PM, so that the pipeline becomes.
>
> - minor cleanup (populateFunctionPassManager)
> - cleanup + global opt
> - inlining + peephole (in the same CGSCC PM)
> # end of compile phase
> # start of LTO phase on the linked module
> - cleanup + global opt + constantmerge
> - inlining + peephole (in the same CGSCC PM)
> - globalopt + globaldce + peephole again
> - optimizations

Ok, this was what I was thinking about - since currently at HEAD that
is what is happening and apparently causing the performance gyrations.

>
>
>
>
>>
>> Rather than exhaustively find the right combination, a couple of data
>> points seem particularly interesting: 1) performance effects of just
>> adding the inlining (and none of the other later opts after your early
>> return) and exiting right after in the PrepareForLTO case;
>
> I think 1) is what I did just above right?

Above is a combination of my 1) and 2).

>
>> 2)
>> performance effects of doing the peephole passes before exiting early
>> in the PrepareForLTO case (so you get the code cleanup before the LTO
>> inlining that might be affecting it's cost analysis).
>
> Can you explain a bit more, I’m not sure I understand what you mean here?

I was thinking of measuring the effects of that intermediate peephole
in isolation.

> The peephole will run as part of the same CGSCC the inliner is part of, especially to cleanup callees before processing the inliner on a caller.
> (It is probably suboptimal inside a single SCC by the way, but I don’t see how the current PM or the new one can solve this).
>
>> Mehdi
>



-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413


More information about the llvm-commits mailing list