[llvm-dev] exploring possibilities for unifying ThinLTO and FullLTO frontend + initial optimization pipeline

Wed Apr 11 12:18:02 PDT 2018

Le mer. 11 avr. 2018 à 11:20, <katya.romanova at sony.com> a écrit :

>
>
>
>
> *From:* Mehdi AMINI <joker.eph at gmail.com>
> *Sent:* Tuesday, April 10, 2018 11:53 PM
> *To:* Romanova, Katya <katya.romanova at sony.com>
> *Cc:* David Blaikie <dblaikie at gmail.com>; Teresa Johnson <
> tejohnson at google.com>; llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] exploring possibilities for unifying ThinLTO
> and FullLTO frontend + initial optimization pipeline
>
>
>
>
>
> Le mar. 10 avr. 2018 à 23:18, <katya.romanova at sony.com> a écrit :
>
> Hi Mehdi,
>
>
>
> Awesome! It’s a very clear design. The only question left is which
> pipeline to choose for unified compile-phase optimization pipeline.
>
> -        ThinLTO compile-phase pipeline? It might very negatively affect
> compile-time and the memory footprint for FullLTO link-phase. That was the
> reason why so many optimization were moved from the link-phase to the
> parallel compile-phase for FullLTO in the first place.
>
>
>
> Just to clarify: "optimizations" were not "moved from the link-phase to
> the parallel compile-phase for FullLTO", they have never been in the link
> phase for FullLTO. It has always been this way.
>
>
>
> I see. What I meant was the following comment from the phabricator review
> about defining the ThinLTO pipeline, but I didn’t remember its exact
> wording.
>
> https://reviews.llvm.org/D17115
>
> “On the contrary to Full LTO, ThinLTO can afford to shift compile time
> from the frontend to the linker: both phases are parallel”.
>
>
>
> I think that the ThinLTO compile-phase pipeline will only affect FullLTO
> in the sense that we need to add more passes during the link phase, is this
> what you meant?
>
>
>
> Yes, that’s exactly what I meant.
>
>
>
>
> -        FullLTO compile-phase pipeline?  More optimization passes at
> compile-phase will obviously increase compile time for ThinLTO, though I
> suspect it will be tolerable. It is not very clear how this choice will
> affect the overall runtime performance for ThinLTO. Assuming we keep
> well-tuned link-phase/backend optimization pipeline “as is” for ThinLTO and
> FullLTO, we will repeat some optimization passes for ThinLTO at
> compile-phase and later at link-phase which potentially could improve the
> performance… or it could make it worse, because we might perform an
> optimization early at compile-time, potentially preventing more aggressive
> optimization at link-phase when we see a larger scope. Any prediction on
> what would happen to the ThinLTO runtime performance at run-time?
>
>
>
> Note: repeating optimization is not supposed to improve performance, at
> least this isn't the goal of the pipeline.
>
> The pipeline for ThinLTO has been modeled on O3, good or bad we felt there
> was no reason to really deviate and any improvement to one could (should!)
> reflect on the other.
>
>
>
> The rational behind the ThinLTO pipeline is not only compile time: it
> split the O3 pipeline at the point where we stop the "function
> simplification" / inliner loop and before we get into
> unrolling/vectorization.
>
> I remember even trying to stop the compile-phase without inlining but the
> generated IR was too big: the inliner CGSCC visit actually reduces the size
> of the IR considerably in some cases.
>
>
>
> Thank you for sharing! It’s a very helpful.
>
>
>
> Mehdi, It seems that you have spent a significant time experimenting with
> ThinLTO pipeline and determining where exactly the compile-phase should end
> and link-phase should start.  How do you envision unified ThinLTO/FullLTO
> compile-phase pipeline? We might tune/improve this pipeline it in the
> future, but having a good starting point is very important too.
>

I don't know: it is all about tradeoffs :)
I was in favor of using a single pipeline based on ~O3, the reason being
mainly that it is easier to maintain/validate/evolve: when folks improve
the O3 pipeline you get the benefit immediately in the ThinLTO optimization
phase, in contrary with FullLTO. The tradeoff is about compile-time: it can
become really long for FullLTO in some extreme cases. I suggested in the
past that such cases could be handled by running the FullLTO linker
optimization phase with O1 to reduce the amount of optimization.

>
>
> -        New “unified” compile-phase pipeline?
>
>
>
> I guess, there is not a definitive answer and we have to experiment,
> measure compile-time/run-time performance and potentially make some
> adjustments to the pipeline and to the thresholds. We have a few
> proprietary tests in Sony that we could use for the performance
> measurements, but it will be nicer if there are some open source benchmarks
> that we could use. What did you use in Google/Apple for ThinLTO/FullLTO
> measurements? Have you used some proprietary benchmarks also? It’s
> important to make sure we won’t have run-time/compile-time performance
> degradation, but it will be nicer if anyone can run previously used
> ThinLTO/FullLTO benchmarks oneself, while making changes to the
> optimization pipeline and heuristics.
>
>
>
> We benchmarked multiple variants of the pipeline two years ago. There were
> some regressions when adoption the ThinLTO pipeline in FullLTO (and some
> improvements), but when investigated we didn't find any real regressions
> that couldn't be solved by fixing the optimizer.
>
>
>
> When referring to ThinLTO and FullLTO pipelines here do you mean
> compile-phase pipeline, link-phase pipeline or full pipeline (i.e.,
> compile-phase + link-phase)? The terminology is slightly confusing here.
>

Here I meant everything: trying to use the exact same pipeline in both
phases.

>
>
> I.e. these are cases where FullLTO gets it right "by luck" and not by
> principle, and fixing such cases helps the non-LTO O3 (for example this
> test case https://bugs.llvm.org/show_bug.cgi?id=27395 )
>
>
>
>
>
> >> # No flag: use the compile-phase preference, perform ThinLTO on a.o and
> FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the
> ThinLTO >> objects
>
> >> $ clang a.o b.o c.o
>
>
>
> If I understood you correctly, while doing ThinLTO on a.o, we could import
> from b.o and c.o (this is possible since the summaries are available),
> while we won’t see a.o when doing FullLTO for b.o/c.o. (i.e., the previous
> non-permeable barrier between ThinLTO and FullLTO groups will become
> permeable in one direction).
>
>
>
> It could be permeable in both direction: b.o+c.o become "like a single
> ThinLTO object" after they get merged.
>
>
>
> I see…
>
> However, do you think by doing this, we will achieve a better performance
> than doing ThinLTO backend for all of the files (a.o, b.o, c.o)?
>
>
>
> Performance is always very much use-case dependent.
>
> One may know that a group of files performs better when they get merged
> together with FullLTO while the rest of the app does not?
>
>
>
> I don't know but this all needs to be carefully looked at from a
> user-interface point of view I think (will it be intuitive for the users?
> Will it fit in every (most) scenarios? etc.).
>
>
>
>
>
> >> # No flag: use the compile-phase preference, perform ThinLTO on a.o and
> FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the
> ThinLTO >> objects
>
> >> $ clang a.o b.o c.o
>
> I wonder if we have a use-case for the “mix and match compile-phase
> preference” situation that you described above? Maybe the linker should
> simply report an error in this case? Or do we have to accept this because
> of backwards compatibility?
>

I don't know :)
We need to consider the cases of "old" bitcode that wouldn't have summaries
(maybe they could get merged in the LTO partition but not participate in
cross-module optimizations?)
We should hear from Apple folks as well.

-- 
Mehdi

>
>
>
>
>
>
>
> Thank you!
>
> Katya.
>
>
>
>
> *From:* Mehdi AMINI <joker.eph at gmail.com>
> *Sent:* Tuesday, April 10, 2018 5:25 PM
> *To:* Romanova, Katya <katya.romanova at sony.com>
> *Cc:* David Blaikie <dblaikie at gmail.com>; Teresa Johnson <
> tejohnson at google.com>; llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] exploring possibilities for unifying ThinLTO
> and FullLTO frontend + initial optimization pipeline
>
>
>
> Hi,
>
>
>
> It is non trivial to recompute summaries (which is why we have summaries
> in the bitcode in the first place by the way), because bitcode is expensive
> to load.
>
>
>
> I think shipping two different variant of the bitcode, one with and one
> without summaries isn't providing much benefit while complicating the flow.
> We could achieve what you're looking for by revisiting the flow a little.
>
>
>
> I would try to consider if we can:
>
>
>
> 1) always generate summaries.
>
> 2) Use the same compile-phase optimization pipeline for ThinLTO and LTO.
>
> 3) Decide at link time if you want to do FullLTO or ThinLTO.
>
>
>
> We haven't got this route 2 years ago because during the bringup we didn't
> want to affect FullLTO in any way, but it may make sense now to have `clang
> -flto=thin` and `clang -flto=full` be identical and change the linker
> plugins to operate either in full-LTO mode or in ThinLTO mode but not
> differentiate based on the availability of the summaries.
>
>
>
> A possible behavior could be:
>
>
>
> # The -flto flag in the compile phase does not change the produced bitcode
> but for a flag that record the preference in the bitcode (FullLTO vs
> ThinLTO)
>
> $ clang -c -flto=thin a.cpp
>
> $ clang -c -flto=full b.cpp
>
> $ clang -c -flto=full c.cpp
>
>
>
> # At link time the behavior depends on the -flto flag passed in.
>
>
>
> # No flag: use the compile-phase preference, perform ThinLTO on a.o and
> FullLTO on b.o/c.o, but allow ThinLTO import between the LTO group and the
> ThinLTO objects
>
> $ clang a.o b.o c.o
>
>
>
> # Forces full LTO, merges all the objects, no cross module importing will
> happen.
>
> clang a.o b.o c.o -flto=full
>
>
>
> # Forces ThinLTO for all objects, FullLTO won't happen, no objects will be
> merged.
>
> clang a.o b.o c.o -flto=thin
>
>
>
> Cheers,
>
>
>
> --
>
> Mehdi
>
>
>
>
>
>
>
>
>
> Le mar. 10 avr. 2018 à 15:51, via llvm-dev <llvm-dev at lists.llvm.org> a
> écrit :
>
> Hi David,
>
> Thank you so much for your reply!
>
>
>
> >> You're dealing with a situation where you are shipped BC files offline
> and then do one, or multiple builds with these BC files?
> Yes, that’s exactly the case.
>
>
>
> >> If the scenario was more like a naive build: Multiple BC files
> generated on a single (multi-core/threaded) machine (but some Thin, some
>
> >> Full) & then fed to the linker, I would wonder if it'd be relatively
> cheap for the LTO step to support this by computing summaries for
>
> >> FullLTO files on the fly (without a separate tool/writing the summary
> to disk, etc).
>
>
>
> I think so. My understanding that for FullLTO files, it’s possible to
> perform name anonymous globals pass and compute summaries on the fly, which
> should allow to perform ThinLTO at link phase.
>
>
>
> Katya.
>
>
>
> *From:* David Blaikie <dblaikie at gmail.com>
> *Sent:* Tuesday, April 10, 2018 7:38 AM
> *To:* Romanova, Katya <katya.romanova at sony.com>; Teresa Johnson <
> tejohnson at google.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] exploring possibilities for unifying ThinLTO
> and FullLTO frontend + initial optimization pipeline
>
>
>
> Hi Katya,
>
> [+Teresa since this is about ThinLTO & she's the owner there]
>
> I'm not sure how other folks feel, but terminologically I'm not sure I
> think of these as different formats (for example you mention the idea of
> stripping the summaries from ThinLTO BC files to then feed them in as
> FullLTO files - I would imagine it'd be reasonable to modify/fix/improve
> the linker integration to have it (perhaps optionally) /ignore/ the
> summaries, or use the summaries but in a non-siloed way (so that there's
> not that optimization boundary between ThinLTO and FullLTO))
>
> You're dealing with a situation where you are shipped BC files offline and
> then do one, or multiple builds with these BC files?
>
> If the scenario was more like a naive build: Multiple BC files generated
> on a single (multi-core/threaded) machine (but some Thin, some Full) & then
> fed to the linker, I would wonder if it'd be relatively cheap for the LTO
> step to support this by computing summaries for FullLTO files on the fly
> (without a separate tool/writing the summary to disk, etc). Though I
> suppose that'd produce a pretty wildly different behavior in the link when
> just a single ThinLTO BC file was added to an otherwise FullLTO build.
>
> Anyway - just some (admittedly fairly uninformed) thoughts. I'm sure
> Teresa has more informed ideas about how this might all look.
>
> On Mon, Apr 9, 2018 at 12:20 PM via llvm-dev <llvm-dev at lists.llvm.org>
> wrote:
>
> Hello,
>
> I am exploring the possibility of unifying the BC file generation phase
> for ThinLTO and FullLTO. Our third party library providers prefer to give
> us only one version of the BC archives, rather than test and ship both Thin
> and Full LTO BC archives. We want to find a way to allow our users to pick
> either Thin or Full LTO, while having only one “unified” version of the BC
> archive.
>
> Note, I am not necessarily proposing to do this work in the upstream
> compiler. If there is no interest from other companies, we might have to
> keep this as a private patch for Sony.
>
> One of the ideas (not my preference) is to mix and match files in the Thin
> and Full BC formats.  I'm not sure how well the "mix and match" scenario
> works in general. I was wondering if Apple or Google are doing this for
> production?
>
> I wrote a toy example, compiled one group of files with ThinLTO and the
> rest with FullLTO, linked them with gold. I saw that irrespective of
> whether the Thin or Full LTO option was used at the link step, files are
> optimized within the Thin group and within the Full group separately, but
> they don't know about the files in the other group (which makes sense).
> Basically, the border between Thin and Full LTO bitcode files created an
> artificial "barrier" which prevented cross-border optimization.
>
> Obviously, I am not too fond of this idea. Even if mixing and matching
> ThinLTO and FullLTO bitcode files will work “as is”, I suspect we will see
> a non-trivial runtime performance degradation because of the
> "ThinLTO"/"FullLTO" border. Are you aware of any potential problems with
> this solution, other than performance?
>
>
>
> Another, hopefully, better idea is to introduce a "unified" BC format,
> which could either be FullLTO, ThinLTO, or neither (e.g., something in
> between).
>
> If the user chooses FullLTO at the link step, but some of the files are in
> the Thin BC format – the linker will call a special LTO API to convert
> these files to the Full LTO BC format (i.e., stripping the module summary
> section + potentially do some additional optimizations from the FullLTO
> pass manager pipeline).
>
> If the user chooses ThinLTO at the link step, but some of the files are in
> the Full BC format – the linker will call an LTO API to convert these files
> to the Thin LTO bitcode format (by regenerating the module summary section
> dynamically for the Full LTO bitcode files).
>
> I think the most reasonable idea for the unification of the Thin and Full
> LTO compilation pipelines is to use Full LTO as the “unified” BC format. If
> the user requests FullLTO – no additional work is needed, the linker will
> perform FullLTO as usual. If the user request ThinLTO, the linker will call
> an API to regenerate the module summary section for all the files in the
> FullLTO format and perform ThinLTO as usual.
>
> In reality I suspect things will be much more complicated. The pipelines
> for the Thin and Full LTO compilation phases are quite different. ThinLTO
> can afford to do much more optimization in the linking phase (since it has
> parallel backends & smaller IR compared to FullLTO), while for FullLTO we
> are forced to move some optimizations from linking to the compilation phase.
>
> So, if we pick FullLTO as our unified format, we would increase the build
> time for ThinLTO (we will be doing the FullLTO initial optimization
> pipeline in the compile phase, which is more than what ThinLTO is currently
> doing, but the pipeline of the optimizations in the backend will stay the
> same). It’s not clear what will happen with the runtime performance: we
> might improve it (because we repeat some of the optimizations several
> times), or we might make it worse (because we might do an optimization in
> the early compilation phase, potentially preventing more aggressive
> optimization later). What are your expectations? Will this approach work in
> general? If so, what do you think will happen with the runtime performance?
>
> I also noticed that the pass manager pipeline is different for
> ThinLTO+Sample PGO (use profile case). This might create some additional
> complications for unification of Thin and FullLTO BC generation phase too,
> but it’s too small detail to worry about right now. I’m more interested in
> choosing a right general direction for solving this problem now.
>
> Please share your thoughts!
>
> Thank you!
>
> Katya.
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180411/132975f8/attachment.html>