<div dir="ltr">Hello,<div><br></div><div>I wanted to consolidate all the discussions and our final thoughts on the concerns raised. I have attached a document consolidating it.</div><div><br></div><div>BOLT’s performance gains inspired this work and we believe BOLT <br></div><div>is a great piece of engineering. However, there are build environments where <br>scalability is critical and memory limits per process are tight :<br><br>* Debug Fission, <a href="https://gcc.gnu.org/wiki/DebugFission">https://gcc.gnu.org/wiki/DebugFission</a> was primarily <br>invented to achieve scalability and better incremental build times while <br>building large binaries with debug information.<br><br>* ThinLTO, <br><a href="http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html">http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html</a> was <br>primarily invented to make LLVM’s full LTO scalable and keep the memory and <br>time overheads low. ThinLTO has enabled much broader adoption of whole <br>program optimization, by making it non-monolithic.<br><br>* For Chromium builds, <br><a href="https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha">https://chromium-review.googlesource.com/c/chromium/src/+/695714/3/build/toolcha</a><br>in/concurrent_links.gni, the linker process memory is set to 10GB with ThinLTO. <br>It was 26GB with Full LTO before that and individual processes will run of out <br>of memory beyond that.<br><br>* Here, <br><a href="https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib">https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistrib</a><br>utedBuildSystemAtGoogleScale.pdf, a distributed build system at Google scale <br>is shown where 5 million binary and test builds are performed every day on <br>several thousands of machines, each with a limitation of 12G of memory per <br>process and 15 minute time-out on tests. Memory overheads of 35G (clang) are <br>well above these thresholds.<br><br>We have developed Propeller like ThinLTO that can be used to obtain similar <br>performance gains like BOLT in such environments.<br></div><div><br></div><div>Thanks</div><div>Sri</div><div> </div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 11, 2019 at 11:25 AM Xinliang David Li via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Oct 11, 2019 at 10:46 AM James Y Knight via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Is there large value from deferring the block ordering to link time? That is, does the block layout algorithm need to consider global layout issues when deciding which blocks to put together and which to relegate to the far-away part of the code?<div><br></div><div>Or, could the propellor-optimized compile step instead split each function into only 2 pieces -- one containing an "optimally-ordered" set of hot blocks from the function, and another containing the cold blocks? The linker would have less flexibility in placement, but maybe it doesn't actually need that flexibility?</div><div><br></div><div>Apologies if this is obvious for those who actually know what they're talking about here. :)</div></div></blockquote><div><br></div><div>It is a fair question. </div><div><br></div><div>We believe the flexibility to do fine grained layout in whole program context is important. PostLinkOptimization is aimed at getting as much performance improvement as possible (usually applied on top of ThinLTO+PGO), so the framework is designed to enable it. </div><div><br></div><div>In particular, it allows the linker to stitch hot bb traces from different functions to be stitched together. It also allows hot trace duplication across procedure boundaries (kind of interprocedural tailDup). Besides, code alignment decisions to minimize branch mispredictions may require global context (e.g, too conflicting branches residing in two different functions). Other micro-arch specific optimizations to improve processor front-end throughput may also require global context.</div><div><br></div><div>It is conceivable to have an option to control the level of granularity at the possible cost of performance.</div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 2, 2019 at 6:18 PM Rafael Auler <<a href="mailto:rafaelauler@fb.com" target="_blank">rafaelauler@fb.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-US">
<div>
<p class="MsoNormal">You’re correct, except that, in Propeller, CFI duplication happens for every basic block as it operates with the conservative assumption that a block can be put anywhere by the linker. That’s a significant bloat that is not cleaned up later.
So, during link time, if N blocks from the same function are contiguous in the final layout, as it should happen most of the time for any sane BB order, we would have several FDEs for a region that only needs one. The bloat goes to the final binary (a lot
more FDEs, specifically, one FDE per basic block).<br>
<br>
BOLT will only split a function in two parts, and only if it has profile. Most of the time, a function is not split. It also has an option not to split at all. For internally reordered basic blocks of a given function, it has CFI deduplication logic (it will
interpret and build the CFI states for each block and rewrite the CFIs in a way that uses the minimum number of instructions to encode the states for each block).<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12pt;color:black">From: </span></b><span style="font-size:12pt;color:black">llvm-dev <<a href="mailto:llvm-dev-bounces@lists.llvm.org" target="_blank">llvm-dev-bounces@lists.llvm.org</a>> on behalf of James Y Knight via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>
<b>Reply-To: </b>James Y Knight <<a href="mailto:jyknight@google.com" target="_blank">jyknight@google.com</a>><br>
<b>Date: </b>Wednesday, October 2, 2019 at 1:59 PM<br>
<b>To: </b>Maksim Panchenko <<a href="mailto:maks@fb.com" target="_blank">maks@fb.com</a>><br>
<b>Cc: </b>"<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>" <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>
<b>Subject: </b>Re: [llvm-dev] [RFC] Propeller: A frame work for Post Link Optimizations<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">I'm a bit confused by this subthread -- doesn't BOLT have the exact same CFI bloat issue? From my cursory reading of the propellor doc, the CFI duplication is _necessary_ to represent discontiguous functions, not anything particular to
the way Propellor happens to generate those discontiguous functions.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">And emitting discontiguous functions is a fundamental goal of this, right?<u></u><u></u></p>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<div>
<p class="MsoNormal">On Wed, Oct 2, 2019 at 4:25 PM Maksim Panchenko via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin-left:4.8pt;margin-right:0in">
<div>
<div>
<p class="MsoNormal">Thanks for clarifying. This means once you move to the next basic block (or any other basic<u></u><u></u></p>
<p class="MsoNormal">block in the function) you have to execute an entirely new set of CFI instructions<u></u><u></u></p>
<p class="MsoNormal">except for the common CIE part. While indeed this is not as bad, on average, the overall<u></u><u></u></p>
<p class="MsoNormal">active memory footprint will increase.<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">Creating one FDE per basic block means that .eh_frame_hdr, an allocatable section,<u></u><u></u></p>
<p class="MsoNormal">will be bloated too. This will increase the FDE lookup time. I don’t see .eh_frame_hdr<u></u><u></u></p>
<p class="MsoNormal">being mentioned in the proposal.<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<p class="MsoNormal">Maksim<u></u><u></u></p>
<p class="MsoNormal"> <u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">On 10/2/19, 12:20 PM, "Krzysztof Pszeniczny" <<a href="mailto:kpszeniczny@google.com" target="_blank">kpszeniczny@google.com</a>> wrote:<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"> <u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal"> <u></u><u></u></p>
</div>
<p class="MsoNormal"> <u></u><u></u></p>
<div>
<div>
<p class="MsoNormal">On Wed, Oct 2, 2019 at 8:41 PM Maksim Panchenko via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<u></u><u></u></p>
</div>
<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0in 0in 0in 6pt;margin:5pt 0in 5pt 4.8pt">
<p class="MsoNormal">*Pessimization/overhead for stack unwinding used by system-wide profilers and<br>
for exception handling*<br>
<br>
Larger CFI programs put an extra burden on unwinding at runtime as more CFI<br>
(and thus native) instructions have to be executed. This will cause more<br>
overhead for any profiler that records stack traces, and, as you correctly note<br>
in the proposal, for any program that heavily uses exceptions.<u></u><u></u></p>
</blockquote>
<div>
<p class="MsoNormal"> <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">The number of CFI instructions that have to be executed when unwinding any given stack stays the same. The CFI instructions for a function have to be duplicated in every basic block
section, but when performing unwinding only one such a set is executed -- the copy for the current basic block. However, this copy contains precisely the same CFI instructions as the ones that would have to be executed if there were no basic block sections.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> <u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">--<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Krzysztof Pszeniczny<u></u><u></u></p>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=kx31RNFp5lAJejEYwuEQ4Zc5A6GakBit07EY08bIAvc&m=-AXqQmc2_r5LuTxyQRxmJESWGU7DLqvYjOlvwJnas_Q&s=h1mfecKZOhD5a1QaEabyI_nHKF81KAXoYRAgR0lNPvM&e=" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><u></u><u></u></p>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote></div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div>