[llvm-dev] PGO is ineffective for Rust - but why?

Tue Dec 3 09:11:24 PST 2019

Interesting. Does PGO mean PGO + ThinLTO here?

On Mon, Dec 2, 2019 at 1:14 AM Michael Woerister via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> For anyone interested, I have a final update on this topic: I've come
> to the conclusion that, with the previously mentioned Cargo issue [1]
> fixed, profile-guided optimization now works as expected with Rust. I
> have a number of reasons to think so:
>
> - I did some semi-automated investigation of benchmarks that did not
> show much of a speedup and was not able to find any missing branch
> weights or function call counts. The concrete branch weights that are
> easy to predict (error paths in code that does not error during
> instrumentation runs) also looked correct to me. I subsequently added
> regression tests to the Rust compiler which make sure that branch
> weights are correct in a number of basic cases.
>
> - I also investigated indirect call promotion and it seems that
> idiomatic Rust code just contains very few indirect calls. I added
> regression tests that make sure that indirect call promotion is
> correctly performed for the two most common cases, calling through a
> function pointer and doing a dynamically dispatched method call.
>
> - Someone brought forth the hypothesis that Rust's much coarser
> compilation unit granularity might (partly) explain the difference of
> PGO effectiveness compared to C/C++ [2] -- and indeed my experiments
> seem to back this hypothesis up. When compiling Rust code for maximum
> performance, one usually lets the compiler generate a single object
> file per crate, which is equivalent to having a single object file per
> static library in C/C++. With this setup, PGO was only able to achieve
> an average 0.3% performance improvement in my benchmarks. However,
> increasing the number of object files to (roughly) one per source file
> led to an average performance improvement of 1.2%, that is, PGO made 4
> times as much of a difference. Reducing ThinLTO's import-instr-limit
> to 10 magnified the effect even more, making the PGO version about 4%
> faster than the non-PGO version, which is well within the range of
> improvement that one can expect from PGO. Interestingly, this last
> configuration with the stricter import limit was the most performant
> one, being also ~3% faster than the single compilation unit setup both
> with and without PGO.
>
> In conclusion, (1) there is no evidence that the implementation is
> broken and (2) there are a number of cases and configurations that
> demonstrate that PGO *can* make as much of a difference as can be
> expected from it.
>
> [1] https://github.com/rust-lang/cargo/issues/7416
> [2]
> https://internals.rust-lang.org/t/profile-guided-optimization-how-well-does-it-work-for-you/11108/11
>
> On Tue, Sep 24, 2019 at 5:15 PM Michael Woerister
> <mwoerister at mozilla.com> wrote:
> >
> > To give a little update here:
> >
> > - I've been further investigating and found an issue [1] with the
> > Cargo build tool that most Rust projects use. This issue prevents all
> > projects using Cargo from properly using PGO because it causes symbol
> > names to be different between the generate and the use phase. With
> > this issue fixed the number of "No profile data available for
> > function" warnings goes down from 92369 to 1167 for the Firefox
> > codebase.
> >
> > - I also found that the potential GNU ld bug mentioned above
> > apparently does not affect Firefox. The number of "No profile data
> > available for function" warnings is exactly the same for GNU ld and
> > LLD. I don't know yet where the remaining 1167 warnings come from
> > though.
> >
> > - Unfortunately, even with all of the above fixes applied, my medium
> > sized benchmark still performs worse with PGO than without it. For my
> > tiny example [2] PGO reduces the number of branch misses by more than
> > 50%. For the medium sized benchmark, however, the PGO version has
> > slightly *more* branch misses. This seems to indicate that there is
> > still something wrong.
> >
> > I will further investigate.
> >
> > [1] https://github.com/rust-lang/cargo/issues/7416
> > [2]
> https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights/
> >
> >
> > On Tue, Sep 17, 2019 at 6:16 PM Xinliang David Li <xinliangli at gmail.com>
> wrote:
> > >
> > > You can check the difference of input args and object files to the
> linker.
> > >
> > > Regarding gnu ld, it is possible that it triggers another bug relating
> to start section and garbage collection. A previous bug is here:
> https://bugs.llvm.org/show_bug.cgi?id=25286
> > >
> > > On Tue, Sep 17, 2019 at 8:39 AM Michael Woerister <
> mwoerister at mozilla.com> wrote:
> > >>
> > >> Interestingly, a C version of the same test program [1] compiled with
> > >> Clang 8 does not have any problems with GNU ld: The `__llvm_prf_data`
> > >> section is the same size for all three linkers. It must be something
> > >> specific to the Rust compiler that's going wrong here.
> > >>
> > >> [1]
> https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/cpp_branch_weights
> > >>
> > >> On Tue, Sep 17, 2019 at 3:26 PM Michael Woerister
> > >> <mwoerister at mozilla.com> wrote:
> > >> >
> > >> > > Can you clarify if performance difference is caused by using
> different linkers at instrumentation build?
> > >> >
> > >> > Yes, good observation! Whether the bug occurs depends only on the
> > >> > linker used for creating the instrumented binary. The linker used
> > >> > during the "use" phase makes no difference.
> > >> >
> > >> > > If that is the case, try dump the sections of the resulting
> binary and compare __llvm_prf_** sections.
> > >> >
> > >> > For the final instrumented executable, it looks like the
> > >> > `__llvm_prf_data` section is 480 bytes large when using GNU ld,
> while
> > >> > it is 528 bytes for gold and lld. The size difference (48 bytes)
> > >> > incidentally is exactly the size of the `__llvm_prf_data` section in
> > >> > the object file containing the code that is later missing branch
> > >> > weights. It looks like the GNU linker loses the `__llvm_prf_data`
> > >> > section from that object file?
> > >> >
> > >> > > Also check the arguments passed to the linker. It should have
> -u__llvm_profile_runtime to force the profile runtime to be linked in.
> > >> >
> > >> > `-u__llvm_profile_runtime` is properly passed to the linker,
> > >> > regardless of which linker it is.
> > >> >
> > >> > On Mon, Sep 16, 2019 at 7:40 PM Xinliang David Li <
> xinliangli at gmail.com> wrote:
> > >> > >
> > >> > > Can you clarify if performance difference is caused by using
> different linkers at instrumentation build?  If that is the case, try dump
> the sections of the resulting binary and compare __llvm_prf_** sections.
> Also check the arguments passed to the linker. It should have
> -u__llvm_profile_runtime   to force the profile runtime to be linked in.
> > >> > >
> > >> > > David
> > >> > >
> > >> > > On Mon, Sep 16, 2019 at 8:42 AM Michael Woerister via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> > >> > >>
> > >> > >> So one interesting observation has already come out of this: I
> > >> > >> confirmed that `rustc` indeed uses `-ffunction-sections` and
> > >> > >> `-fdata-sections` on all platforms except for macOS. When trying
> out
> > >> > >> different linkers for a small test case [1], however, I found
> that
> > >> > >> there were rather large differences in execution time:
> > >> > >>
> > >> > >> ld (no PGO) = 172 ms
> > >> > >> ld (PGO) = 196 ms
> > >> > >>
> > >> > >> gold (no PGO) = 182 ms
> > >> > >> gold (PGO) = 141 ms
> > >> > >>
> > >> > >> lld (no PGO) = 193 ms
> > >> > >> lld (PGO) = 171 ms
> > >> > >>
> > >> > >> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
> > >> > >> linked programs are slower with PGO. I then noticed that branch
> > >> > >> weights for `ld` were missing from most branches, while the
> counts for
> > >> > >> the other linkers are correct. All of this suggests to me that
> > >> > >> something goes wrong when `ld` tries to link in the profiling
> runtime.
> > >> > >>
> > >> > >> I'll be investigating further.
> > >> > >>
> > >> > >> [1]
> https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
> > >> > >>
> > >> > >>
> > >> > >> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <
> tejohnson at google.com> wrote:
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <
> tejohnson at google.com> wrote:
> > >> > >> >>
> > >> > >> >> I just have a couple suggestions off the top of my head:
> > >> > >> >> - have you tried using the new pass manager
> (-fexperimental-new-pass-manager)? That has access to additional analysis
> info during inlining and is able to make more precise PGO based inline
> decisions.
> > >> > >> >
> > >> > >> >
> > >> > >> > (although note the above shouldn't make the difference between
> no performance and a typical PGO performance boost)
> > >> > >> >
> > >> > >> > Another thing I just thought of - are you using
> -ffunction-sections and -fdata-sections? These will allow for PGO based
> function layout in the linker (assuming you are using lld or gold).
> > >> > >> >
> > >> > >> >> - have you tried collecting profile data with and without PGO
> to see if you can compare where cycles are being spent? That's my usual way
> of debugging performance differences related to inlining or profile changes.
> > >> > >> >> - just a comment that it is odd you are getting better
> performance without the pre-inlining - which typically helps because you
> get better context-sensitive profile info. Maybe sanity check that the pre
> inlining is kicking in for both the profile gen and use passes?
> > >> > >> >>
> > >> > >> >> Teresa
> > >> > >> >>
> > >> > >> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via
> llvm-dev <llvm-dev at lists.llvm.org> wrote:
> > >> > >> >>>
> > >> > >> >>> Hi everyone,
> > >> > >> >>>
> > >> > >> >>> As part of my work for Mozilla's Low Level Tools team I've
> > >> > >> >>> implemented PGO in the Rust compiler. The feature is
> > >> > >> >>> available since Rust 1.37 [1]. However, so far we have not
> > >> > >> >>> seen any actual performance gains from enabling PGO for
> > >> > >> >>> Rust code. Performance even seems to drop 1-3% with PGO
> > >> > >> >>> enabled. I wonder why that is and I'm hoping that someone
> > >> > >> >>> here might have experience debugging PGO effectiveness.
> > >> > >> >>>
> > >> > >> >>>
> > >> > >> >>> PGO in the Rust compiler
> > >> > >> >>> ------------------------
> > >> > >> >>>
> > >> > >> >>> The Rust compiler uses IR-level instrumentation (the
> > >> > >> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
> > >> > >> >>> This has worked pretty well and even enables doing PGO for
> > >> > >> >>> mixed Rust/C++ codebases when also using Clang.
> > >> > >> >>>
> > >> > >> >>> The Rust compiler has regression tests that make sure that:
> > >> > >> >>>
> > >> > >> >>> - instrumentation shows up in LLVM IR for the `generate`
> phase,
> > >> > >> >>>   and that
> > >> > >> >>>
> > >> > >> >>> - profiling data is actually used during the `use` phase,
> i.e.
> > >> > >> >>>   that cold functions get marked with `cold` and hot
> functions
> > >> > >> >>>   get marked with `inline`.
> > >> > >> >>>
> > >> > >> >>> I also verified manually that `branch_weights` are being set
> > >> > >> >>> in IR. So, from my perspective, the PGO implementation does
> > >> > >> >>> what it is supposed to do.
> > >> > >> >>>
> > >> > >> >>> However, as already mentioned, in all benchmarks I've seen so
> > >> > >> >>> far performance seems to stay the same at best and often even
> > >> > >> >>> suffers slightly. Which is suprising because for C++ code
> > >> > >> >>> using Clang's version of IR-level instrumentation & PGO
> brings
> > >> > >> >>> signifcant gains (up to 5-10% from what I've seen in
> > >> > >> >>> benchmarks for Firefox).
> > >> > >> >>>
> > >> > >> >>> One thing we noticed early on is that disabling the
> > >> > >> >>> pre-inlining pass (`-disable-preinline`) seems to
> consistently
> > >> > >> >>> improve the situation for Rust code. Doing that we sometimes
> > >> > >> >>> see performance wins of almost 1% over not using PGO. This
> > >> > >> >>> again is very different to C++ where disabling this pass
> > >> > >> >>> causes dramatic performance loses for the Firefox benchmarks.
> > >> > >> >>> And 1% performance improvement is still well below
> > >> > >> >>> expectations, I think.
> > >> > >> >>>
> > >> > >> >>> So my questions to you are:
> > >> > >> >>>
> > >> > >> >>> - Has anybody here observed something similar while
> > >> > >> >>>   wokring on or with PGO?
> > >> > >> >>>
> > >> > >> >>> - Are there certain known characteristics of LLVM IR code
> > >> > >> >>>   that inhibit PGO's effectiveness and that IR produced by
> > >> > >> >>>   `rustc` might exhibit?
> > >> > >> >>>
> > >> > >> >>> - Does anybody know of a good source that describes how to
> > >> > >> >>>   effectively debug a problem like this?
> > >> > >> >>>
> > >> > >> >>> - Does anybody know of a small example program in C/C++
> > >> > >> >>>   that is known to profit from PGO and that could be
> > >> > >> >>>   re-implemented in Rust for comparison?
> > >> > >> >>>
> > >> > >> >>> Thanks a lot for reading! Any help is appreciated.
> > >> > >> >>>
> > >> > >> >>> -Michael
> > >> > >> >>>
> > >> > >> >>> [1]
> https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
> > >> > >> >>> _______________________________________________
> > >> > >> >>> LLVM Developers mailing list
> > >> > >> >>> llvm-dev at lists.llvm.org
> > >> > >> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >> > >> >>
> > >> > >> >>
> > >> > >> >>
> > >> > >> >> --
> > >> > >> >> Teresa Johnson | Software Engineer | tejohnson at google.com |
> > >> > >> >
> > >> > >> >
> > >> > >> >
> > >> > >> > --
> > >> > >> > Teresa Johnson | Software Engineer | tejohnson at google.com |
> > >> > >> _______________________________________________
> > >> > >> LLVM Developers mailing list
> > >> > >> llvm-dev at lists.llvm.org
> > >> > >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191203/bc3c4dc8/attachment.html>