[llvm-dev] exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline

Wed Apr 11 12:18:56 PDT 2018

See attached some quick slides (backup from the dev meeting talk) about the
pass pipeline.

-- 
Mehdi

Le mer. 11 avr. 2018 à 12:18, Mehdi AMINI <joker.eph at gmail.com> a écrit :

>
>
> Le mer. 11 avr. 2018 à 11:20, <katya.romanova at sony.com> a écrit :
>
>>
>>
>>
>>
>> *From:* Mehdi AMINI <joker.eph at gmail.com>
>> *Sent:* Tuesday, April 10, 2018 11:53 PM
>> *To:* Romanova, Katya <katya.romanova at sony.com>
>> *Cc:* David Blaikie <dblaikie at gmail.com>; Teresa Johnson <
>> tejohnson at google.com>; llvm-dev <llvm-dev at lists.llvm.org>
>> *Subject:* Re: [llvm-dev] exploring possibilities for unifying ThinLTO
>> and FullLTO frontend + initial optimization pipeline
>>
>>
>>
>>
>>
>> Le mar. 10 avr. 2018 à 23:18, <katya.romanova at sony.com> a écrit :
>>
>> Hi Mehdi,
>>
>>
>>
>> Awesome! It’s a very clear design. The only question left is which
>> pipeline to choose for unified compile-phase optimization pipeline.
>>
>> -        ThinLTO compile-phase pipeline? It might very negatively affect
>> compile-time and the memory footprint for FullLTO link-phase. That was the
>> reason why so many optimization were moved from the link-phase to the
>> parallel compile-phase for FullLTO in the first place.
>>
>>
>>
>> Just to clarify: "optimizations" were not "moved from the link-phase to
>> the parallel compile-phase for FullLTO", they have never been in the link
>> phase for FullLTO. It has always been this way.
>>
>>
>>
>> I see. What I meant was the following comment from the phabricator review
>> about defining the ThinLTO pipeline, but I didn’t remember its exact
>> wording.
>>
>> https://reviews.llvm.org/D17115
>>
>> “On the contrary to Full LTO, ThinLTO can afford to shift compile time
>> from the frontend to the linker: both phases are parallel”.
>>
>>
>>
>> I think that the ThinLTO compile-phase pipeline will only affect FullLTO
>> in the sense that we need to add more passes during the link phase, is this
>> what you meant?
>>
>>
>>
>> Yes, that’s exactly what I meant.
>>
>>
>>
>>
>> -        FullLTO compile-phase pipeline?  More optimization passes at
>> compile-phase will obviously increase compile time for ThinLTO, though I
>> suspect it will be tolerable. It is not very clear how this choice will
>> affect the overall runtime performance for ThinLTO. Assuming we keep
>> well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and
>> FullLTO, we will repeat some optimization passes for ThinLTO at
>> compile-phase and later at link-phase which potentially could improve the
>> performance… or it could make it worse, because we might perform an
>> optimization early at compile-time, potentially preventing more aggressive
>> optimization at link-phase when we see a larger scope. Any prediction on
>> what would happen to the ThinLTO runtime performance at run-time?
>>
>>
>>
>> Note: repeating optimization is not supposed to improve performance, at
>> least this isn't the goal of the pipeline.
>>
>> The pipeline for ThinLTO has been modeled on O3, good or bad we felt
>> there was no reason to really deviate and any improvement to one could
>> (should!) reflect on the other.
>>
>>
>>
>> The rational behind the ThinLTO pipeline is not only compile time: it
>> split the O3 pipeline at the point where we stop the "function
>> simplification" / inliner loop and before we get into
>> unrolling/vectorization.
>>
>> I remember even trying to stop the compile-phase without inlining but the
>> generated IR was too big: the inliner CGSCC visit actually reduces the size
>> of the IR considerably in some cases.
>>
>>
>>
>> Thank you for sharing! It’s a very helpful.
>>
>>
>>
>> Mehdi, It seems that you have spent a significant time experimenting with
>> ThinLTO pipeline and determining where exactly the compile-phase should end
>> and link-phase should start.  How do you envision unified ThinLTO/FullLTO
>> compile-phase pipeline? We might tune/improve this pipeline it in the
>> future, but having a good starting point is very important too.
>>
>
> I don't know: it is all about tradeoffs :)
> I was in favor of using a single pipeline based on ~O3, the reason being
> mainly that it is easier to maintain/validate/evolve: when folks improve
> the O3 pipeline you get the benefit immediately in the ThinLTO optimization
> phase, in contrary with FullLTO. The tradeoff is about compile-time: it can
> become really long for FullLTO in some extreme cases. I suggested in the
> past that such cases could be handled by running the FullLTO linker
> optimization phase with O1 to reduce the amount of optimization.
>
>
>
>
>>
>>
>> -        New “unified” compile-phase pipeline?
>>
>>
>>
>> I guess, there is not a definitive answer and we have to experiment,
>> measure compile-time/run-time performance and potentially make some
>> adjustments to the pipeline and to the thresholds. We have a few
>> proprietary tests in Sony that we could use for the performance
>> measurements, but it will be nicer if there are some open source benchmarks
>> that we could use. What did you use in Google/Apple for ThinLTO/FullLTO
>> measurements? Have you used some proprietary benchmarks also? It’s
>> important to make sure we won’t have run-time/compile-time performance
>> degradation, but it will be nicer if anyone can run previously used
>> ThinLTO/FullLTO benchmarks oneself, while making changes to the
>> optimization pipeline and heuristics.
>>
>>
>>
>> We benchmarked multiple variants of the pipeline two years ago. There
>> were some regressions when adoption the ThinLTO pipeline in FullLTO (and
>> some improvements), but when investigated we didn't find any real
>> regressions that couldn't be solved by fixing the optimizer.
>>
>>
>>
>> When referring to ThinLTO and FullLTO pipelines here do you mean
>> compile-phase pipeline, link-phase pipeline or full pipeline (i.e.,
>> compile-phase + link-phase)? The terminology is slightly confusing here.
>>
>
>
> Here I meant everything: trying to use the exact same pipeline in both
> phases.
>
>
>
>>
>>
>> I.e. these are cases where FullLTO gets it right "by luck" and not by
>> principle, and fixing such cases helps the non-LTO O3 (for example this
>> test case https://bugs.llvm.org/show_bug.cgi?id=27395 )
>>
>>
>>
>>
>>
>> >> # No flag: use the compile-phase preference, perform ThinLTO on a.o
>> and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and
>> the ThinLTO >> objects
>>
>> >> $ clang a.o b.o c.o
>>
>>
>>
>> If I understood you correctly, while doing ThinLTO on a.o, we could
>> import from b.o and c.o (this is possible since the summaries are
>> available), while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e.,
>> the previous non-permeable barrier between ThinLTO and FullLTO groups will
>> become permeable in one direction).
>>
>>
>>
>> It could be permeable in both direction: b.o+c.o become "like a single
>> ThinLTO object" after they get merged.
>>
>>
>>
>> I see…
>>
>> However, do you think by doing this, we will achieve a better performance
>> than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?
>>
>>
>>
>> Performance is always very much use-case dependent.
>>
>> One may know that a group of files performs better when they get merged
>> together with FullLTO while the rest of the app does not?
>>
>>
>>
>> I don't know but this all needs to be carefully looked at from a
>> user-interface point of view I think (will it be intuitive for the users?
>> Will it fit in every (most) scenarios? etc.).
>>
>>
>>
>>
>>
>> >> # No flag: use the compile-phase preference, perform ThinLTO on a.o
>> and FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and
>> the ThinLTO >> objects
>>
>> >> $ clang a.o b.o c.o
>>
>> I wonder if we have a use-case for the “mix and match compile-phase
>> preference” situation that you described above? Maybe the linker should
>> simply report an error in this case? Or do we have to accept this because
>> of backwards compatibility?
>>
>
> I don't know :)
> We need to consider the cases of "old" bitcode that wouldn't have
> summaries (maybe they could get merged in the LTO partition but not
> participate in cross-module optimizations?)
> We should hear from Apple folks as well.
>
> --
> Mehdi
>
>
>>
>>
>>
>>
>>
>>
>>
>> Thank you!
>>
>> Katya.
>>
>>
>>
>>
>> *From:* Mehdi AMINI <joker.eph at gmail.com>
>> *Sent:* Tuesday, April 10, 2018 5:25 PM
>> *To:* Romanova, Katya <katya.romanova at sony.com>
>> *Cc:* David Blaikie <dblaikie at gmail.com>; Teresa Johnson <
>> tejohnson at google.com>; llvm-dev <llvm-dev at lists.llvm.org>
>> *Subject:* Re: [llvm-dev] exploring possibilities for unifying ThinLTO
>> and FullLTO frontend + initial optimization pipeline
>>
>>
>>
>> Hi,
>>
>>
>>
>> It is non trivial to recompute summaries (which is why we have summaries
>> in the bitcode in the first place by the way), because bitcode is expensive
>> to load.
>>
>>
>>
>> I think shipping two different variant of the bitcode, one with and one
>> without summaries isn't providing much benefit while complicating the flow.
>> We could achieve what you're looking for by revisiting the flow a little.
>>
>>
>>
>> I would try to consider if we can:
>>
>>
>>
>> 1) always generate summaries.
>>
>> 2) Use the same compile-phase optimization pipeline for ThinLTO and LTO.
>>
>> 3) Decide at link time if you want to do FullLTO or ThinLTO.
>>
>>
>>
>> We haven't got this route 2 years ago because during the bringup we
>> didn't want to affect FullLTO in any way, but it may make sense now to have
>> `clang -flto=thin` and `clang -flto=full` be identical and change the
>> linker plugins to operate either in full-LTO mode or in ThinLTO mode but
>> not differentiate based on the availability of the summaries.
>>
>>
>>
>> A possible behavior could be:
>>
>>
>>
>> # The -flto flag in the compile phase does not change the produced
>> bitcode but for a flag that record the preference in the bitcode (FullLTO
>> vs ThinLTO)
>>
>> $ clang -c -flto=thin a.cpp
>>
>> $ clang -c -flto=full b.cpp
>>
>> $ clang -c -flto=full c.cpp
>>
>>
>>
>> # At link time the behavior depends on the -flto flag passed in.
>>
>>
>>
>> # No flag: use the compile-phase preference, perform ThinLTO on a.o and
>> FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the
>> ThinLTO objects
>>
>> $ clang a.o b.o c.o
>>
>>
>>
>> # Forces full LTO, merges all the objects, no cross module importing will
>> happen.
>>
>> clang a.o b.o c.o -flto=full
>>
>>
>>
>> # Forces ThinLTO for all objects, FullLTO won't happen, no objects will
>> be merged.
>>
>> clang a.o b.o c.o -flto=thin
>>
>>
>>
>> Cheers,
>>
>>
>>
>> --
>>
>> Mehdi
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Le mar. 10 avr. 2018 à 15:51, via llvm-dev <llvm-dev at lists.llvm.org> a
>> écrit :
>>
>> Hi David,
>>
>> Thank you so much for your reply!
>>
>>
>>
>> >> You're dealing with a situation where you are shipped BC files offline
>> and then do one, or multiple builds with these BC files?
>> Yes, that’s exactly the case.
>>
>>
>>
>> >> If the scenario was more like a naive build: Multiple BC files
>> generated on a single (multi-core/threaded) machine (but some Thin, some
>>
>> >> Full) & then fed to the linker, I would wonder if it'd be relatively
>> cheap for the LTO step to support this by computing summaries for
>>
>> >> FullLTO files on the fly (without a separate tool/writing the summary
>> to disk, etc).
>>
>>
>>
>> I think so. My understanding that for FullLTO files, it’s possible to
>> perform name anonymous globals pass and compute summaries on the fly, which
>> should allow to perform ThinLTO at link phase.
>>
>>
>>
>> Katya.
>>
>>
>>
>> *From:* David Blaikie <dblaikie at gmail.com>
>> *Sent:* Tuesday, April 10, 2018 7:38 AM
>> *To:* Romanova, Katya <katya.romanova at sony.com>; Teresa Johnson <
>> tejohnson at google.com>
>> *Cc:* llvm-dev at lists.llvm.org
>> *Subject:* Re: [llvm-dev] exploring possibilities for unifying ThinLTO
>> and FullLTO frontend + initial optimization pipeline
>>
>>
>>
>> Hi Katya,
>>
>> [+Teresa since this is about ThinLTO & she's the owner there]
>>
>> I'm not sure how other folks feel, but terminologically I'm not sure I
>> think of these as different formats (for example you mention the idea of
>> stripping the summaries from ThinLTO BC files to then feed them in as
>> FullLTO files - I would imagine it'd be reasonable to modify/fix/improve
>> the linker integration to have it (perhaps optionally) /ignore/ the
>> summaries, or use the summaries but in a non-siloed way (so that there's
>> not that optimization boundary between ThinLTO and FullLTO))
>>
>> You're dealing with a situation where you are shipped BC files offline
>> and then do one, or multiple builds with these BC files?
>>
>> If the scenario was more like a naive build: Multiple BC files generated
>> on a single (multi-core/threaded) machine (but some Thin, some Full) & then
>> fed to the linker, I would wonder if it'd be relatively cheap for the LTO
>> step to support this by computing summaries for FullLTO files on the fly
>> (without a separate tool/writing the summary to disk, etc). Though I
>> suppose that'd produce a pretty wildly different behavior in the link when
>> just a single ThinLTO BC file was added to an otherwise FullLTO build.
>>
>> Anyway - just some (admittedly fairly uninformed) thoughts. I'm sure
>> Teresa has more informed ideas about how this might all look.
>>
>> On Mon, Apr 9, 2018 at 12:20 PM via llvm-dev <llvm-dev at lists.llvm.org>
>> wrote:
>>
>> Hello,
>>
>> I am exploring the possibility of unifying the BC file generation phase
>> for ThinLTO and FullLTO. Our third party library providers prefer to give
>> us only one version of the BC archives, rather than test and ship both Thin
>> and Full LTO BC archives. We want to find a way to allow our users to pick
>> either Thin or Full LTO, while having only one “unified” version of the BC
>> archive.
>>
>> Note, I am not necessarily proposing to do this work in the upstream
>> compiler. If there is no interest from other companies, we might have to
>> keep this as a private patch for Sony.
>>
>> One of the ideas (not my preference) is to mix and match files in the
>> Thin and Full BC formats.  I'm not sure how well the "mix and match"
>> scenario works in general. I was wondering if Apple or Google are doing
>> this for production?
>>
>> I wrote a toy example, compiled one group of files with ThinLTO and the
>> rest with FullLTO, linked them with gold. I saw that irrespective of
>> whether the Thin or Full LTO option was used at the link step, files are
>> optimized within the Thin group and within the Full group separately, but
>> they don't know about the files in the other group (which makes sense).
>> Basically, the border between Thin and Full LTO bitcode files created an
>> artificial "barrier" which prevented cross-border optimization.
>>
>> Obviously, I am not too fond of this idea. Even if mixing and matching
>> ThinLTO and FullLTO bitcode files will work “as is”, I suspect we will see
>> a non-trivial runtime performance degradation because of the
>> "ThinLTO"/"FullLTO" border. Are you aware of any potential problems with
>> this solution, other than performance?
>>
>>
>>
>> Another, hopefully, better idea is to introduce a "unified" BC format,
>> which could either be FullLTO, ThinLTO, or neither (e.g., something in
>> between).
>>
>> If the user chooses FullLTO at the link step, but some of the files are
>> in the Thin BC format – the linker will call a special LTO API to convert
>> these files to the Full LTO BC format (i.e., stripping the module summary
>> section + potentially do some additional optimizations from the FullLTO
>> pass manager pipeline).
>>
>> If the user chooses ThinLTO at the link step, but some of the files are
>> in the Full BC format – the linker will call an LTO API to convert these
>> files to the Thin LTO bitcode format (by regenerating the module summary
>> section dynamically for the Full LTO bitcode files).
>>
>> I think the most reasonable idea for the unification of the Thin and Full
>> LTO compilation pipelines is to use Full LTO as the “unified” BC format. If
>> the user requests FullLTO – no additional work is needed, the linker will
>> perform FullLTO as usual. If the user request ThinLTO, the linker will call
>> an API to regenerate the module summary section for all the files in the
>> FullLTO format and perform ThinLTO as usual.
>>
>> In reality I suspect things will be much more complicated. The pipelines
>> for the Thin and Full LTO compilation phases are quite different. ThinLTO
>> can afford to do much more optimization in the linking phase (since it has
>> parallel backends & smaller IR compared to FullLTO), while for FullLTO we
>> are forced to move some optimizations from linking to the compilation phase.
>>
>> So, if we pick FullLTO as our unified format, we would increase the build
>> time for ThinLTO (we will be doing the FullLTO initial optimization
>> pipeline in the compile phase, which is more than what ThinLTO is currently
>> doing, but the pipeline of the optimizations in the backend will stay the
>> same). It’s not clear what will happen with the runtime performance: we
>> might improve it (because we repeat some of the optimizations several
>> times), or we might make it worse (because we might do an optimization in
>> the early compilation phase, potentially preventing more aggressive
>> optimization later). What are your expectations? Will this approach work in
>> general? If so, what do you think will happen with the runtime performance?
>>
>> I also noticed that the pass manager pipeline is different for
>> ThinLTO+Sample PGO (use profile case). This might create some additional
>> complications for unification of Thin and FullLTO BC generation phase too,
>> but it’s too small detail to worry about right now. I’m more interested in
>> choosing a right general direction for solving this problem now.
>>
>> Please share your thoughts!
>>
>> Thank you!
>>
>> Katya.
>>
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180411/ccd0c689/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ThinLTO Pipeline.pdf
Type: application/pdf
Size: 383195 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180411/ccd0c689/attachment-0001.pdf>