[llvm-dev] Making Clang/LLVM faster using code layout optimizations

Sriraman Tallam via llvm-dev llvm-dev at lists.llvm.org
Thu Oct 18 14:20:09 PDT 2018


Hi,


  We, the compiler optimization team at Google, have been working for
the past year on further improving the performance of PGO binaries.
We have also evaluated Facebook’s BOLT in this context and we would
like to share our findings and thoughts.

TLDR;  Facebook’s BOLT, a Post Link Optimizer (PLO), is able to
improve the performance of our large well optimized (PGO + ThinLTO)
benchmark binaries but suffers from scalability issues. We believe
that a PLO tool assisted by the linker would solve most of these
issues and we are working on some proposals and prototypes towards
this.


Summary of our findings:

* While PGO gives huge performance speedups, there are lots of
instances where profile information is either inaccurate or absent due
to the various  transformations applied. While the profile counts are
precise at the moment they are read, the compiler cannot maintain the
precision of the profiles after some transformations.

* For example, context sensitive profile information is lost as the
counters for collecting profiles are created before the regular
inliner is invoked.

* Optimization passes such as branch lowering, jump threading, loop
transformation passes introduce additional control-flow and the
profiles for these branches are estimated and can be very inaccurate
leading to non-optimal layout of basic blocks.

* Early this year, before we started evaluating BOLT, we worked on
addressing the problem of loss of profile precision from PGO.

* We have found that an iterative PGO solution, where the second
iteration has optimized context sensitive profiles is able to improve
our key benchmarks by 2%, more details below.

* We recently evaluated BOLT, a post link optimizer (PLO) on the same
benchmarks.  BOLT works on the profiles of the final PGO optimized
binary which captures all the context-sensitive and flow-sensitive
information and hence is not prone to the above issues.

* We found that BOLT gives us similar performance boosts, mainly due
to basic block reordering.   BOLT additionally is able to improve the
same benchmarks by 0.5% due to function splitting.

* However BOLT suffers from scalability issues, ~45G of memory and
more than an hour to process a 700M binary. It also needs to update
debug info and requires rigorous testing as it is a stand-alone tool.

* We believe that a PLO that is assisted by the linker to do a large
part of the work is the right direction forward.

* The PLO will focus on the disassembly and analysis and would use the
linker’s assistance at appropriate points to read the binary and write
the modified binary.

* We have some ideas in this space where the PLO can operate with the
assistance of the linker.  Towards this, we are working on a proposal
and prototype for basic block sections.


Iterative PGO experiment:

* Early this year, before we started evaluating BOLT, we worked on
addressing the problem of loss of profile precision from PGO.

* Instrumentation based PGO is known to have context-insensitive
profiles. The profiles after inline transformation are scaled
uniformly based on the call-site count. Early inlining partially
mitigates the problem but it’s not enough.

* The idea here is to have another round of PGO instrumentation after
inline transformation. This profile will only be used by the
optimizations after inline, like basic block layout.

* The key here is that the second pass profile is precise in terms of
context sensitivity and it will not affect the inlining decisions.

* We compared BOLT with this approach and found that with basic block
reordering alone, this approach does as well as BOLT, if not slightly
better.

* We then used BOLT to further optimize this binary and obtained an
additional speedup of 0.5%, because of function splitting which is not
available in the compiler today.  It should be noted that BOLT’s basic
block reordering did not find any additional opportunities for
performance improvement on this iteratively optimized PGO binary.

* Note that BOLT has other optimizations too but the performance
boosts we obtained were only due to BOLT’s basic block reordering
(cache+ algorithm) and function splitting.


A case for a Post Link Optimizer (PLO) assisted by the linker:

* Our experiments with BOLT show that it is able to work on large
binaries, with some caveats and bug fixes, and does manage to squeeze
performance improvements even on very well optimized binaries.

* We believe that a Post Link Optimizer (PLO) like BOLT is very useful
as it provides a framework to do whole program analysis at machine
code level (on information not exposed at IR level) and enables whole
program transformations that requires global ordering (e.g. :
Interprocedural register allocation).

* We have in fact prototyped an interprocedural spill optimization in
BOLT and found non-trivial performance improvements in some of our
micro-benchmarks.

* However, we feel there are shortcomings of BOLT that need to be addressed :

  + Scalability : Time and memory required currently is not within
acceptable limits for widespread deployment.

  + Debug Info:  BOLT requires fixing debug info after the
transformations, it does a good job of it but more needs to be done
here.

  + QA : It is an independent tool and needs rigorous testing.

* We are looking at improving the scalability of BOLT, in fact, we
have been able to reduce the profile conversion time from 40 minutes
to 2 minutes, but there is still a lot to be done.

* We believe that many parts of the PLO can be done with the linker’s
assistance which can mitigate some of the above shortcomings.

* For instance, we are working on a proposal and prototype for basic
block sections which can let the PLO do basic block reordering and
function splitting in a scalable manner with the linker’s help.  We
will share our proposal for this shortly.


Thanks,
Sri





On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>
> We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:
>
> https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md
>
> While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.
>
> More details about BOLT are available in our arXiv.org paper at: https://arxiv.org/pdf/1807.06735.pdf
>
> Maksim & Rafael
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>


More information about the llvm-dev mailing list