[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Thu Sep 26 12:03:07 PDT 2019

On Wed, Sep 25, 2019 at 3:44 PM Michael Spencer <bigcheesegs at gmail.com> wrote:
>
> On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> Greetings,
>>
>> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
>> framework, on large google benchmarks and noticed that it improves key
>> performance metrics of these benchmarks by 2% to 6%, which is pretty impressive
>> as this is over and above a baseline binaryalready heavily optimized with
>> ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of
>> binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also
>> profile guided and does very aggressive performance optimizations, there is
>> more room for performance improvements due to profile approximations while
>> applying the transformations.  BOLT uses exact profiles from the final binary
>> and is able to fill the gaps left by ThinLTO + PGO. The performance
>> improvements due to BOLT come from basic block layout, function reordering and
>> function splitting.
>>
>> While BOLT does an excellent job of squeezing extra performance from highly
>> optimized binaries with optimizations such as code layout, it has these major
>> issues:
>>
>>  * It does not take advantage of distributed build systems.
>>  * It has scalability issues and to rewrite a binary with a ~300M text segment
>> size:
>>  * Memory foot-print is 70G.
>>  * It takes more than 10 minutes to rewrite the binary.
>>
>> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
>> original binary, optimizes and rewrites the final binary in one process.  This
>> limits the scalability of BOLT and the memory and time overhead shoots up
>> quickly for large binaries.
>>
>> Inspired by the performance gains and to address the scalability issue of BOLT,
>> we went about designing a scalable infrastructure that can perform BOLT-like
>> post-link optimizations. In this RFC, we discuss our system, “Propeller”,
>> which can perform profile guided link time binary optimizations in a scalable
>> way and is friendly to distributed build systems.  Our system leverages the
>> existing capabilities of the compiler tool-chain and is not a stand alone tool.
>> Like BOLT, our system boosts the performance of optimized binaries via
>> link-time optimizations using accurate profiles of the binary. We discuss the
>> Propeller system and show how to do the whole program basic block layout using
>> Propeller.
>>
>> Propeller does whole program basic block layout at link time via basic block
>> sections.  We have added support for having each basic block in its own section
>> which allows the linker to do arbitrary reorderings of basic blocks to achieve
>> any desired fine-grain code layout which includes block layout, function
>> splitting and function reordering.  Our experiments on large real-world
>> applications and SPEC with code layout show that Propeller can optimize as
>> effectively as BOLT, with just 20% of its memory footprint and time overhead.
>>
>> An LLVM branch with propeller patches is available in the git repository here:
>> https://github.com/google/llvm-propeller/  We will upload individual patches of
>> the various elements for review.  We have attached a google doc describing the
>> Propeller system with Experiments in detail,
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>>
>>
>> Quick Start - How to optimize further with Propeller?
>>
>> # git clone and build repo
>>
>> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git
>>
>> $ mkdir $BUILD_DIR && cd $BUILD_DIR
>>
>> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
>>    $LLVM_DIR/llvm-propeller/llvm
>>
>> $ ninja -j25
>>
>> $ export PATH=$BUILD_DIR/bin:$PATH
>>
>>
>> # Let’s Propeller-optimize the following program:
>>
>>
>> # Step 1: Build the peak optimized binary with an additional flag.
>>
>> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld
>>
>> # Step 2: Profile the binary, only one side of the branch is executed.
>> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null
>>
>>
>> # Step 3: Convert the profiles using the tool provided
>> $ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
>>   --binary=./a.out.labels --profile=perf.data  --out=perf.propeller
>>
>>
>> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.
>> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld
>>
>> In Step 4, the optimized bit code can be used if it is saved in Step1 as
>> Propeller is active only during compile backend and link.  The optimized binary
>> has a different layout of the basic blocks in main to keep the executed blocks
>> together and split the cold blocks.
>>
>> Some of the key points:
>>
>> +  Added support for basic block sections, similar to function sections, where
>> each basic block can reside in its own section.
>>
>> +  Jump instructions need 32-bit relocations and subsequent linker relaxations
>> after basic block layout.  We would like to add a new relocation type for jump
>> instructions to make it easier for relaxations and guarantee correctness.
>>
>> +  Added support in the linker to read profiles (PMU LBR) and discover optimal
>> basic block layout using the Ex-TSP algorithm described here:
>> https://arxiv.org/abs/1809.04676
>>
>> +  Added support in the linker to re-order basic block sections in any
>> specified sequence (can be done with symbol ordering file).  This requires
>> linker relaxations to delete and shrink branches across basic blocks.
>>
>> +  Compared our system to BOLT  and have shown that our system can produce
>> similar performance improvements as BOLT but with much less memory and time
>> overheads.  Our experiments are on very large warehouse-scale benchmarks and
>> SPEC 2017.
>>
>> +  Listed why this cannot be done as part of PGO itself.  Post Link
>> optimizations are able to transform the binary using accurate profiles and PGO
>> passes suffer from profile imprecision.
>>
>> +  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks
>> via basic block sections.
>>
>> +  Discussed CFI handling  and is sub-optimal and bloats object file sizes and
>> binary sizes much more than DebugInfo due to lack of support for discontiguous
>> address ranges.  We have talked about this and would like to make a case to
>> support discontiguous ranges with CFI which will make basic block sections much
>> more cheaper.
>>
>> Detailed RFC document here :
>> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>>
>> Please let us know what you think,
>> Thanks
>> Sri on behalf of the team.
>
>
> The function ordering parts of this look like they are doing the same thing as lld's existing function reordering code, just with a different profile source and different ordering algorithm. Have you compared the performance of just the function reordering bits to what we already have (https://github.com/google/llvm-propeller/blob/plo-dev/lld/ELF/CallGraphSort.cpp)?

We looked at the code and it looks like the function reordering parts
are using the hfsort algorithm, both  CallgraphSort and Propeller.
So, we can merge these.  When callgraphsort option is passed,
Propeller  will invoke the hfsort code to do function layout.  We
prefer doing this as a follow-up patch if that is alright as this
requires more plumbing.

>
> I would expect that whichever ordering algorithm is best would be the best for both profile sources, so I would recommend looking at integrating the function reordering code with the existing code instead of adding a separate one.
>
> - Michael Spencer