[llvm-dev] PGO is ineffective for Rust - but why?

Thu Sep 12 02:18:25 PDT 2019

Hi everyone,

As part of my work for Mozilla's Low Level Tools team I've
implemented PGO in the Rust compiler. The feature is
available since Rust 1.37 [1]. However, so far we have not
seen any actual performance gains from enabling PGO for
Rust code. Performance even seems to drop 1-3% with PGO
enabled. I wonder why that is and I'm hoping that someone
here might have experience debugging PGO effectiveness.

PGO in the Rust compiler
------------------------

The Rust compiler uses IR-level instrumentation (the
equivalent of Clang's `-fprofile-generate`/`-fprofile-use`).
This has worked pretty well and even enables doing PGO for
mixed Rust/C++ codebases when also using Clang.

The Rust compiler has regression tests that make sure that:

- instrumentation shows up in LLVM IR for the `generate` phase,
  and that

- profiling data is actually used during the `use` phase, i.e.
  that cold functions get marked with `cold` and hot functions
  get marked with `inline`.

I also verified manually that `branch_weights` are being set
in IR. So, from my perspective, the PGO implementation does
what it is supposed to do.

However, as already mentioned, in all benchmarks I've seen so
far performance seems to stay the same at best and often even
suffers slightly. Which is suprising because for C++ code
using Clang's version of IR-level instrumentation & PGO brings
signifcant gains (up to 5-10% from what I've seen in
benchmarks for Firefox).

One thing we noticed early on is that disabling the
pre-inlining pass (`-disable-preinline`) seems to consistently
improve the situation for Rust code. Doing that we sometimes
see performance wins of almost 1% over not using PGO. This
again is very different to C++ where disabling this pass
causes dramatic performance loses for the Firefox benchmarks.
And 1% performance improvement is still well below
expectations, I think.

So my questions to you are:

- Has anybody here observed something similar while
  wokring on or with PGO?

- Are there certain known characteristics of LLVM IR code
  that inhibit PGO's effectiveness and that IR produced by
  `rustc` might exhibit?

- Does anybody know of a good source that describes how to
  effectively debug a problem like this?

- Does anybody know of a small example program in C/C++
  that is known to profit from PGO and that could be
  re-implemented in Rust for comparison?

Thanks a lot for reading! Any help is appreciated.

-Michael

[1] https://blog.rust-lang.org/2019/08/15/Rust-1.37.0.html#profile-guided-optimization