[llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations

Wed Sep 25 01:36:07 PDT 2019

Hi Sriraman,

This is an impressive piece of work! The results look really good, and the
document you provided is very thorough. Looking forward to the patches :)

Best,

-- 
Mehdi

On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Greetings,
>
> We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer
> framework, on large google benchmarks and noticed that it improves key
> performance metrics of these benchmarks by 2% to 6%, which is pretty
> impressive
> as this is over and above a baseline binaryalready heavily optimized with
> ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance
> of
> binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is
> also
> profile guided and does very aggressive performance optimizations, there is
> more room for performance improvements due to profile approximations while
> applying the transformations.  BOLT uses exact profiles from the final
> binary
> and is able to fill the gaps left by ThinLTO + PGO. The performance
> improvements due to BOLT come from basic block layout, function reordering
> and
> function splitting.
>
> While BOLT does an excellent job of squeezing extra performance from highly
> optimized binaries with optimizations such as code layout, it has these
> major
> issues:
>
>  * It does not take advantage of distributed build systems.
>  * It has scalability issues and to rewrite a binary with a ~300M text
> segment
> size:
>  * Memory foot-print is 70G.
>  * It takes more than 10 minutes to rewrite the binary.
>
> Similar to Full LTO, BOLT’s design is monolithic as it disassembles the
> original binary, optimizes and rewrites the final binary in one process.
> This
> limits the scalability of BOLT and the memory and time overhead shoots up
> quickly for large binaries.
>
> Inspired by the performance gains and to address the scalability issue of
> BOLT,
> we went about designing a scalable infrastructure that can perform
> BOLT-like
> post-link optimizations. In this RFC, we discuss our system, “Propeller”,
> which can perform profile guided link time binary optimizations in a
> scalable
> way and is friendly to distributed build systems.  Our system leverages the
> existing capabilities of the compiler tool-chain and is not a stand alone
> tool.
> Like BOLT, our system boosts the performance of optimized binaries via
> link-time optimizations using accurate profiles of the binary. We discuss
> the
> Propeller system and show how to do the whole program basic block layout
> using
> Propeller.
>
> Propeller does whole program basic block layout at link time via basic
> block
> sections.  We have added support for having each basic block in its own
> section
> which allows the linker to do arbitrary reorderings of basic blocks to
> achieve
> any desired fine-grain code layout which includes block layout, function
> splitting and function reordering.  Our experiments on large real-world
> applications and SPEC with code layout show that Propeller can optimize as
> effectively as BOLT, with just 20% of its memory footprint and time
> overhead.
>
> An LLVM branch with propeller patches is available in the git repository
> here:
> https://github.com/google/llvm-propeller/  We will upload individual
> patches of
> the various elements for review.  We have attached a google doc describing
> the
> Propeller system with Experiments in detail,
> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>
>
> Quick Start - How to optimize further with Propeller?
>
> # git clone and build repo
>
> $ cd $LLVM_DIR && git clone https://github.com/google/llvm-propeller.git
>
> $ mkdir $BUILD_DIR && cd $BUILD_DIR
>
> $ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \
>    $LLVM_DIR/llvm-propeller/llvm
>
> $ ninja -j25
>
> $ export PATH=$BUILD_DIR/bin:$PATH
>
>
> # Let’s Propeller-optimize the following program:
>
>
> # Step 1: Build the peak optimized binary with an additional flag.
>
> $ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels
> -fuse-ld=lld
>
> # Step 2: Profile the binary, only one side of the branch is executed.
> $ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&
> /dev/null
>
>
> # Step 3: Convert the profiles using the tool provided
> $ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \
>   --binary=./a.out.labels --profile=perf.data  --out=perf.propeller
>
>
> # Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag
> changed.
> $ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller
> -fuse-ld=lld
>
> In Step 4, the optimized bit code can be used if it is saved in Step1 as
> Propeller is active only during compile backend and link.  The optimized
> binary
> has a different layout of the basic blocks in main to keep the executed
> blocks
> together and split the cold blocks.
>
> Some of the key points:
>
> +  Added support for basic block sections, similar to function sections,
> where
> each basic block can reside in its own section.
>
> +  Jump instructions need 32-bit relocations and subsequent linker
> relaxations
> after basic block layout.  We would like to add a new relocation type for
> jump
> instructions to make it easier for relaxations and guarantee correctness.
>
> +  Added support in the linker to read profiles (PMU LBR) and discover
> optimal
> basic block layout using the Ex-TSP algorithm described here:
> https://arxiv.org/abs/1809.04676
>
> +  Added support in the linker to re-order basic block sections in any
> specified sequence (can be done with symbol ordering file).  This requires
> linker relaxations to delete and shrink branches across basic blocks.
>
> +  Compared our system to BOLT  and have shown that our system can produce
> similar performance improvements as BOLT but with much less memory and time
> overheads.  Our experiments are on very large warehouse-scale benchmarks
> and
> SPEC 2017.
>
> +  Listed why this cannot be done as part of PGO itself.  Post Link
> optimizations are able to transform the binary using accurate profiles and
> PGO
> passes suffer from profile imprecision.
>
> +  Updated DebugInfo and CFI to account for arbitrary ordering of basic
> blocks
> via basic block sections.
>
> +  Discussed CFI handling  and is sub-optimal and bloats object file sizes
> and
> binary sizes much more than DebugInfo due to lack of support for
> discontiguous
> address ranges.  We have talked about this and would like to make a case to
> support discontiguous ranges with CFI which will make basic block sections
> much
> more cheaper.
>
> Detailed RFC document here :
> https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
>
> Please let us know what you think,
> Thanks
> Sri on behalf of the team.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190925/48752c98/attachment.html>