<div dir="ltr">Hi Sriraman,<div><br></div><div>This is an impressive piece of work! The results look really good, and the document you provided is very thorough. Looking forward to the patches :)</div><div><br></div><div>Best,</div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 24, 2019 at 4:52 PM Sriraman Tallam via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Greetings,<br>

<br>

We, at Google, recently evaluated Facebook’s BOLT, a Post Link Optimizer<br>

framework, on large google benchmarks and noticed that it improves key<br>

performance metrics of these benchmarks by 2% to 6%, which is pretty impressive<br>

as this is over and above a baseline binaryalready heavily optimized with<br>

ThinLTO + PGO.  Furthermore, BOLT is also able to improve the performance of<br>

binaries optimized via Context-Sensitive PGO.     While ThinLTO + PGO is also<br>

profile guided and does very aggressive performance optimizations, there is<br>

more room for performance improvements due to profile approximations while<br>

applying the transformations.  BOLT uses exact profiles from the final binary<br>

and is able to fill the gaps left by ThinLTO + PGO. The performance<br>

improvements due to BOLT come from basic block layout, function reordering and<br>

function splitting.<br>

<br>

While BOLT does an excellent job of squeezing extra performance from highly<br>

optimized binaries with optimizations such as code layout, it has these major<br>

issues:<br>

<br>

 * It does not take advantage of distributed build systems.<br>

 * It has scalability issues and to rewrite a binary with a ~300M text segment<br>

size:<br>

 * Memory foot-print is 70G.<br>

 * It takes more than 10 minutes to rewrite the binary.<br>

<br>

Similar to Full LTO, BOLT’s design is monolithic as it disassembles the<br>

original binary, optimizes and rewrites the final binary in one process.  This<br>

limits the scalability of BOLT and the memory and time overhead shoots up<br>

quickly for large binaries.<br>

<br>

Inspired by the performance gains and to address the scalability issue of BOLT,<br>

we went about designing a scalable infrastructure that can perform BOLT-like<br>

post-link optimizations. In this RFC, we discuss our system, “Propeller”,<br>

which can perform profile guided link time binary optimizations in a scalable<br>

way and is friendly to distributed build systems.  Our system leverages the<br>

existing capabilities of the compiler tool-chain and is not a stand alone tool.<br>

Like BOLT, our system boosts the performance of optimized binaries via<br>

link-time optimizations using accurate profiles of the binary. We discuss the<br>

Propeller system and show how to do the whole program basic block layout using<br>

Propeller.<br>

<br>

Propeller does whole program basic block layout at link time via basic block<br>

sections.  We have added support for having each basic block in its own section<br>

which allows the linker to do arbitrary reorderings of basic blocks to achieve<br>

any desired fine-grain code layout which includes block layout, function<br>

splitting and function reordering.  Our experiments on large real-world<br>

applications and SPEC with code layout show that Propeller can optimize as<br>

effectively as BOLT, with just 20% of its memory footprint and time overhead.<br>

<br>

An LLVM branch with propeller patches is available in the git repository here:<br>

<a href="https://github.com/google/llvm-propeller/" rel="noreferrer" target="_blank">https://github.com/google/llvm-propeller/</a>  We will upload individual patches of<br>

the various elements for review.  We have attached a google doc describing the<br>

Propeller system with Experiments in detail,<br>

<a href="https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf" rel="noreferrer" target="_blank">https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf</a><br>

<br>

<br>

Quick Start - How to optimize further with Propeller?<br>

<br>

# git clone and build repo<br>

<br>

$ cd $LLVM_DIR && git clone <a href="https://github.com/google/llvm-propeller.git" rel="noreferrer" target="_blank">https://github.com/google/llvm-propeller.git</a><br>

<br>

$ mkdir $BUILD_DIR && cd $BUILD_DIR<br>

<br>

$ cmake -G Ninja -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt" … \<br>

   $LLVM_DIR/llvm-propeller/llvm<br>

<br>

$ ninja -j25<br>

<br>

$ export PATH=$BUILD_DIR/bin:$PATH<br>

<br>

<br>

# Let’s Propeller-optimize the following program:<br>

<br>

<br>

# Step 1: Build the peak optimized binary with an additional flag.<br>

<br>

$ clang++ -O2 main.cc callee.cc -fpropeller-label -o a.out.labels -fuse-ld=lld<br>

<br>

# Step 2: Profile the binary, only one side of the branch is executed.<br>

$ perf record -e cycles:u -j any,u -- ./a.out.labels 1000000 2 >&  /dev/null<br>

<br>

<br>

# Step 3: Convert the profiles using the tool provided<br>

$ $LLVM_DIR/llvm-propeller/create_llvm_prof  --format=propeller \<br>

  --binary=./a.out.labels --profile=perf.data  --out=perf.propeller<br>

<br>

<br>

# Step 4: Re-Optimize with Propeller, repeat Step 1 with propeller flag changed.<br>

$ clang++ -O2 main.cc callee.cc -fpropeller-optimize=perf.propeller -fuse-ld=lld<br>

<br>

In Step 4, the optimized bit code can be used if it is saved in Step1 as<br>

Propeller is active only during compile backend and link.  The optimized binary<br>

has a different layout of the basic blocks in main to keep the executed blocks<br>

together and split the cold blocks.<br>

<br>

Some of the key points:<br>

<br>

+  Added support for basic block sections, similar to function sections, where<br>

each basic block can reside in its own section.<br>

<br>

+  Jump instructions need 32-bit relocations and subsequent linker relaxations<br>

after basic block layout.  We would like to add a new relocation type for jump<br>

instructions to make it easier for relaxations and guarantee correctness.<br>

<br>

+  Added support in the linker to read profiles (PMU LBR) and discover optimal<br>

basic block layout using the Ex-TSP algorithm described here:<br>

<a href="https://arxiv.org/abs/1809.04676" rel="noreferrer" target="_blank">https://arxiv.org/abs/1809.04676</a><br>

<br>

+  Added support in the linker to re-order basic block sections in any<br>

specified sequence (can be done with symbol ordering file).  This requires<br>

linker relaxations to delete and shrink branches across basic blocks.<br>

<br>

+  Compared our system to BOLT  and have shown that our system can produce<br>

similar performance improvements as BOLT but with much less memory and time<br>

overheads.  Our experiments are on very large warehouse-scale benchmarks and<br>

SPEC 2017.<br>

<br>

+  Listed why this cannot be done as part of PGO itself.  Post Link<br>

optimizations are able to transform the binary using accurate profiles and PGO<br>

passes suffer from profile imprecision.<br>

<br>

+  Updated DebugInfo and CFI to account for arbitrary ordering of basic blocks<br>

via basic block sections.<br>

<br>

+  Discussed CFI handling  and is sub-optimal and bloats object file sizes and<br>

binary sizes much more than DebugInfo due to lack of support for discontiguous<br>

address ranges.  We have talked about this and would like to make a case to<br>

support discontiguous ranges with CFI which will make basic block sections much<br>

more cheaper.<br>

<br>

Detailed RFC document here :<br>

<a href="https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf" rel="noreferrer" target="_blank">https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf</a><br>

<br>

Please let us know what you think,<br>

Thanks<br>

Sri on behalf of the team.<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div>