[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Wed Sep 25 17:02:00 PDT 2019

My biggest question about this architecture is about when propeller runs basic block reordering within a function.  It seems like a lot of the complexity comes from using the proposed -fbasicblock-sections to generated mangled ELF, and then re-parsing the mangled ELF as a separate step.  I'm not sure that's the right approach, long-term.

Splitting every basic block into its own section introduces overhead, like you described.  And it's likely more complex on non-x86 targets, which have a greater variety of conditional branches.  And the reordering itself involves a bunch of x86 and ELF-specific code.

I'd like to suggest an alternative: instead of perform basic block reordering and function splitting by manipulating the ELF files, you could perform reordering and splitting as part of link-time code generation, as an MIR pass that runs just before the assembly printer.  MIR is almost exactly the form you want for this sort of manipulation: it has basic blocks which correspond closely to the final binary, and a high-level representation of branch instructions.  And it's before the DWARF/CFI emission, so you don't need to worry about fixing them afterwards.  This should take less code overall, and much less target-specific code. And infrastructure for function splitting would be useful for non-Propeller workflows.

There are some minor downsides to this approach I can think of.  You lose a little flexibility, in that you can't mix blocks from different functions together, but you aren't doing that anyway, from your description?  It might take more CPU time to re-do LTO code generation, but that's probably not a big deal for applications where the user is willing to collect two different kinds PGO of profiles for a program.

I understand you can't use a regular PGO profile for this, but it should be possible to pass in a Propeller-PGO profile separately to the compiler, and apply that to the MIR as a late pass in the compiler, separate from any regular PGO profile.  There are probably multiple ways you could do this; you could use basic block symbols, like your current implementation.

-Eli

-----Original Message-----
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Sriraman Tallam via llvm-dev
Sent: Tuesday, September 24, 2019 4:51 PM
To: llvm-dev at lists.llvm.org
Subject: [EXT] [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Greetings,

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
framework, on large google benchmarks and noticed that it improves key
performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
as this is over and above a baseline binaryalready heavily optimized with
ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
profile guided and does very aggressive performance optimizations, there is
more room for performance improvements due to profile approximations while
applying the transformations.  BOLT uses exact profiles from the final binary
and is able to fill the gaps left by ThinLTO + PGO. The performance
improvements due to BOLT come from basic block layout, function reordering and
function splitting.

While BOLT does an excellent job of squeezing extra performance from highly
optimized binaries with optimizations such as code layout, it has these major
issues:

 * It does not take advantage of distributed build systems.
 * It has scalability issues and to rewrite a binary with a ~300M text segment
size:
 * Memory foot-print is 70G.
 * It takes more than 10 minutes to rewrite the binary.

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
original binary, optimizes and rewrites the final binary in one process.  This
limits the scalability of BOLT and the memory and time overhead shoots up
quickly for large binaries.

Inspired by the performance gains and to address the scalability issue of BOLT,
we went about designing a scalable infrastructure that can perform BOLT-like
post-link optimizations. In this RFC, we discuss our system, “Propeller”,
which can perform profile guided link time binary optimizations in a scalable
way and is friendly to distributed build systems.  Our system leverages the
existing capabilities of the compiler tool-chain and is not a stand alone tool.
Like BOLT, our system boosts the performance of optimized binaries via
link-time optimizations using accurate profiles of the binary. We discuss the
Propeller system and show how to do the whole program basic block layout using
Propeller.

Propeller does whole program basic block layout at link time via basic block
sections.  We have added support for having each basic block in its own section
which allows the linker to do arbitrary reorderings of basic blocks to achieve
any desired fine-grain code layout which includes block layout, function
splitting and function reordering.  Our experiments on large real-world
applications and SPEC with code layout show that Propeller can optimize as
effectively as BOLT, with just 20% of its memory footprint and time overhead.

An LLVM branch with propeller patches is available in the git repository here:
https://github.com/google/llvm-propeller/  We will upload individual patches of
the various elements for review.  We have attached a google doc describing the
Propeller system with Experiments in detail,
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Quick Start - How to optimize further with Propeller?

# git clone and build repo

$ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git

$ mkdir $BUILD_DIR && cd $BUILD_DIR

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
   $LLVM_DIR/llvm-propeller/llvm

$ ninja -j25

$ export PATH=$BUILD_DIR/bin:$PATH

# Let’s Propeller-optimize the following program:

# Step 1: Build the peak optimized binary with an additional flag.

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld

# Step 2: Profile the binary, only one side of the branch is executed.
$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null

# Step 3: Convert the profiles using the tool provided
$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller

# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld

In Step 4, the optimized bit code can be used if it is saved in Step1 as
Propeller is active only during compile backend and link.  The optimized binary
has a different layout of the basic blocks in main to keep the executed blocks
together and split the cold blocks.

Some of the key points:

+  Added support for basic block sections, similar to function sections, where
each basic block can reside in its own section.

+  Jump instructions need 32-bit relocations and subsequent linker relaxations
after basic block layout.  We would like to add a new relocation type for jump
instructions to make it easier for relaxations and guarantee correctness.

+  Added support in the linker to read profiles (PMU LBR) and discover optimal
basic block layout using the Ex-TSP algorithm described here:
https://arxiv.org/abs/1809.04676

+  Added support in the linker to re-order basic block sections in any
specified sequence (can be done with symbol ordering file).  This requires
linker relaxations to delete and shrink branches across basic blocks.

+  Compared our system to BOLT  and have shown that our system can produce
similar performance improvements as BOLT but with much less memory and time
overheads.  Our experiments are on very large warehouse-scale benchmarks and
SPEC 2017.

+  Listed why this cannot be done as part of PGO itself.  Post Link
optimizations are able to transform the binary using accurate profiles and PGO
passes suffer from profile imprecision.

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
via basic block sections.

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
binary sizes much more than DebugInfo due to lack of support for discontiguous
address ranges.  We have talked about this and would like to make a case to
support discontiguous ranges with CFI which will make basic block sections much
more cheaper.

Detailed RFC document here :
https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf

Please let us know what you think,
Thanks
Sri on behalf of the team.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev