[llvm-dev] PGO is ineffective for Rust - but why?

Wed Sep 18 05:32:17 PDT 2019

Thank you, Teresa!

Down the road we definitely will want to combine PGO with ThinLTO.

On Tue, Sep 17, 2019 at 5:45 PM Teresa Johnson <tejohnson at google.com> wrote:

>
>
> On Tue, Sep 17, 2019 at 6:25 AM Michael Woerister <mwoerister at mozilla.com>
> wrote:
>
>>
>> > By ld do you mean GNU ld?
>>
>> Yes, GNU ld version 2.31.1 on Fedora 30.
>>
>> > I know GNU ld does "work" with LLVM's gold plugin, but it's an untested
>> combination and not recommended.
>>
>> That's good to know! However, in this case no linker plugin is involved.
>> All of LLVM is executed within the Rust compiler and the linker only ever
>> gets to see regular object files.
>>
>
> Ugh, I was confusing your PGO issue with an LTO issue - there is no plugin
> involved in non-LTO! And GNU ld should be fine with regular obj files
> produced by LLVM. Sorry for the confusion!
>
> It sounds like David had the right intuition on what might be going on,
> I'll let him follow up with you on that as he has a better understanding of
> the instrumentation side.
>
> Teresa
>
>
>> On Mon, Sep 16, 2019 at 7:07 PM Teresa Johnson <tejohnson at google.com>
>> wrote:
>>
>>> Interesting. By ld do you mean GNU ld? I know GNU ld does "work" with
>>> LLVM's gold plugin, but it's an untested combination and not recommended. I
>>> wouldn't be surprised if there were some issues around it not passing
>>> necessary info to the gold plugin.
>>>
>>> Teresa
>>>
>>> On Mon, Sep 16, 2019 at 8:41 AM Michael Woerister <
>>> mwoerister at mozilla.com> wrote:
>>>
>>>> So one interesting observation has already come out of this: I
>>>> confirmed that `rustc` indeed uses `-ffunction-sections` and
>>>> `-fdata-sections` on all platforms except for macOS. When trying out
>>>> different linkers for a small test case [1], however, I found that
>>>> there were rather large differences in execution time:
>>>>
>>>> ld (no PGO) = 172 ms
>>>> ld (PGO) = 196 ms
>>>>
>>>> gold (no PGO) = 182 ms
>>>> gold (PGO) = 141 ms
>>>>
>>>> lld (no PGO) = 193 ms
>>>> lld (PGO) = 171 ms
>>>>
>>>> So `gold` and `lld` both profit from PGO quite a bit, while `ld`
>>>> linked programs are slower with PGO. I then noticed that branch
>>>> weights for `ld` were missing from most branches, while the counts for
>>>> the other linkers are correct. All of this suggests to me that
>>>> something goes wrong when `ld` tries to link in the profiling runtime.
>>>>
>>>> I'll be investigating further.
>>>>
>>>> [1]
>>>> https://github.com/michaelwoerister/rust-pgo-test-programs/tree/master/branch_weights
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 6:31 PM Teresa Johnson <tejohnson at google.com>
>>>> wrote:
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Sep 12, 2019 at 8:18 AM Teresa Johnson <tejohnson at google.com>
>>>> wrote:
>>>> >>
>>>> >> I just have a couple suggestions off the top of my head:
>>>> >> - have you tried using the new pass manager
>>>> (-fexperimental-new-pass-manager)? That has access to additional analysis
>>>> info during inlining and is able to make more precise PGO based inline
>>>> decisions.
>>>> >
>>>> >
>>>> > (although note the above shouldn't make the difference between no
>>>> performance and a typical PGO performance boost)
>>>> >
>>>> > Another thing I just thought of - are you using -ffunction-sections
>>>> and -fdata-sections? These will allow for PGO based function layout in the
>>>> linker (assuming you are using lld or gold).
>>>> >
>>>> >> - have you tried collecting profile data with and without PGO to see
>>>> if you can compare where cycles are being spent? That's my usual way of
>>>> debugging performance differences related to inlining or profile changes.
>>>> >> - just a comment that it is odd you are getting better performance
>>>> without the pre-inlining - which typically helps because you get better
>>>> context-sensitive profile info. Maybe sanity check that the pre inlining is
>>>> kicking in for both the profile gen and use passes?
>>>> >>
>>>> >> Teresa
>>>> >>
>>>> >> On Thu, Sep 12, 2019 at 2:18 AM Michael Woerister via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>> >>>
>>>> >>> Hi everyone,
>>>> >>>
>>>> >>> As part of my work for Mozilla's Low Level Tools team I've
>>>> >>> implemented PGO in the Rust compiler. The feature is
>>>> >>> available since Rust 1.37 [1]. However, so far we have not
>>>> >>> seen any actual performance gains from enabling PGO for
>>>> >>> Rust code. Performance even seems to drop 1-3% with PGO
>>>> >>> enabled. I wonder why that is and I'm hoping that someone
>>>> >>> here might have experience debugging PGO effectiveness.
>>>> >>>
>>>> >>>
>>>> >>> PGO in the Rust compiler
>>>> >>> ------------------------
>>>> >>>
>>>> >>> The Rust compiler uses IR-level instrumentation (the
>>>> >>> equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
>>>> >>> This has worked pretty well and even enables doing PGO for
>>>> >>> mixed Rust/C++ codebases when also using Clang.
>>>> >>>
>>>> >>> The Rust compiler has regression tests that make sure that:
>>>> >>>
>>>> >>> - instrumentation shows up in LLVM IR for the `generate` phase,
>>>> >>>   and that
>>>> >>>
>>>> >>> - profiling data is actually used during the `use` phase, i.e.
>>>> >>>   that cold functions get marked with `cold` and hot functions
>>>> >>>   get marked with `inline`.
>>>> >>>
>>>> >>> I also verified manually that `branch_weights` are being set
>>>> >>> in IR. So, from my perspective, the PGO implementation does
>>>> >>> what it is supposed to do.
>>>> >>>
>>>> >>> However, as already mentioned, in all benchmarks I've seen so
>>>> >>> far performance seems to stay the same at best and often even
>>>> >>> suffers slightly. Which is suprising because for C++ code
>>>> >>> using Clang's version of IR-level instrumentation & PGO brings
>>>> >>> signifcant gains (up to 5-10% from what I've seen in
>>>> >>> benchmarks for Firefox).
>>>> >>>
>>>> >>> One thing we noticed early on is that disabling the
>>>> >>> pre-inlining pass (`-disable-preinline`) seems to consistently
>>>> >>> improve the situation for Rust code. Doing that we sometimes
>>>> >>> see performance wins of almost 1% over not using PGO. This
>>>> >>> again is very different to C++ where disabling this pass
>>>> >>> causes dramatic performance loses for the Firefox benchmarks.
>>>> >>> And 1% performance improvement is still well below
>>>> >>> expectations, I think.
>>>> >>>
>>>> >>> So my questions to you are:
>>>> >>>
>>>> >>> - Has anybody here observed something similar while
>>>> >>>   wokring on or with PGO?
>>>> >>>
>>>> >>> - Are there certain known characteristics of LLVM IR code
>>>> >>>   that inhibit PGO's effectiveness and that IR produced by
>>>> >>>   `rustc` might exhibit?
>>>> >>>
>>>> >>> - Does anybody know of a good source that describes how to
>>>> >>>   effectively debug a problem like this?
>>>> >>>
>>>> >>> - Does anybody know of a small example program in C/C++
>>>> >>>   that is known to profit from PGO and that could be
>>>> >>>   re-implemented in Rust for comparison?
>>>> >>>
>>>> >>> Thanks a lot for reading! Any help is appreciated.
>>>> >>>
>>>> >>> -Michael
>>>> >>>
>>>> >>> [1]
>>>> https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization
>>>> >>> _______________________________________________
>>>> >>> LLVM Developers mailing list
>>>> >>> llvm-dev at lists.llvm.org
>>>> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Teresa Johnson | Software Engineer | tejohnson at google.com |
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Teresa Johnson | Software Engineer | tejohnson at google.com |
>>>>
>>>
>>>
>>> --
>>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>>
>>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190918/7e0b42f3/attachment.html>