<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Fri, Oct 19, 2018 at 2:09 PM Maksim Panchenko via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks for sharing your findings, Sri. Implementing and benchmarking<br>

iterative PGO is particularly insightful, and I'm glad the techniques BOLT<br>

uses were able to achieve similar gains and more with function splitting.<br>

<br>

I like the idea of improving BOLT by utilizing information from the linker,<br>

and possible integration with the linker. I assume you mean lld. When<br>

we created BOLT, one of the design goals was to make it work with any<br>

compiler and any linker. </blockquote><div><br></div><div>lld is certainly one of the main linkers we should add support, but I think we should make the design independent of linker (using common interfaces) from the beginning.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">However, if we can get extra benefit in performance<br>

and/or usability from integrating and exploiting functionality of lld, I'm all<br>

for it. At the same time, keeping the standalone functionality will be useful<br>

in scenarios where re-linking (especially with LTO/ThinLTO) is undesirable<br>

or impossible. We should also think about platforms where lld is<br>

not (yet) available.<br>

<br></blockquote><div><br></div><div>It is possible to make PLO works in the standalone mode as BOLT while also operate in a mode that involves a PLO driver, various optimization components and linker. The latter can also heavily make use of compiler generated annotations. </div><div><br></div><div>Sri and Han plans to send some proposal for discussion in the near future.</div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I'm confident with the right design approach, we can achieve both goals,<br>

and make the optimizations part of lld while keeping the standalone tool.<br>

<br>

Running into scalability issues when optimizing 700MB of code is<br>

understandable. At the same time, saving memory at processing time wasn't<br>

our top priority, and I'm sure there's lots of room for improvement.<br>

<br>

Further optimizations of register allocation sounds like a promising idea.<br>

<br>

Looking forward to your patches and the proposal.<br>

<br>

Maksim<br>

<br>

On 10/18/18, 2:20 PM, "Sriraman Tallam" <<a href="mailto:tmsriram@google.com" target="_blank">tmsriram@google.com</a>> wrote:<br>

<br>

    Hi,<br>

<br>

<br>

      We, the compiler optimization team at Google, have been working for<br>

    the past year on further improving the performance of PGO binaries.<br>

    We have also evaluated Facebook’s BOLT in this context and we would<br>

    like to share our findings and thoughts.<br>

<br>

    TLDR;  Facebook’s BOLT, a Post Link Optimizer (PLO), is able to<br>

    improve the performance of our large well optimized (PGO + ThinLTO)<br>

    benchmark binaries but suffers from scalability issues. We believe<br>

    that a PLO tool assisted by the linker would solve most of these<br>

    issues and we are working on some proposals and prototypes towards<br>

    this.<br>

<br>

<br>

    Summary of our findings:<br>

<br>

    * While PGO gives huge performance speedups, there are lots of<br>

    instances where profile information is either inaccurate or absent due<br>

    to the various  transformations applied. While the profile counts are<br>

    precise at the moment they are read, the compiler cannot maintain the<br>

    precision of the profiles after some transformations.<br>

<br>

    * For example, context sensitive profile information is lost as the<br>

    counters for collecting profiles are created before the regular<br>

    inliner is invoked.<br>

<br>

    * Optimization passes such as branch lowering, jump threading, loop<br>

    transformation passes introduce additional control-flow and the<br>

    profiles for these branches are estimated and can be very inaccurate<br>

    leading to non-optimal layout of basic blocks.<br>

<br>

    * Early this year, before we started evaluating BOLT, we worked on<br>

    addressing the problem of loss of profile precision from PGO.<br>

<br>

    * We have found that an iterative PGO solution, where the second<br>

    iteration has optimized context sensitive profiles is able to improve<br>

    our key benchmarks by 2%, more details below.<br>

<br>

    * We recently evaluated BOLT, a post link optimizer (PLO) on the same<br>

    benchmarks.  BOLT works on the profiles of the final PGO optimized<br>

    binary which captures all the context-sensitive and flow-sensitive<br>

    information and hence is not prone to the above issues.<br>

<br>

    * We found that BOLT gives us similar performance boosts, mainly due<br>

    to basic block reordering.   BOLT additionally is able to improve the<br>

    same benchmarks by 0.5% due to function splitting.<br>

<br>

    * However BOLT suffers from scalability issues, ~45G of memory and<br>

    more than an hour to process a 700M binary. It also needs to update<br>

    debug info and requires rigorous testing as it is a stand-alone tool.<br>

<br>

    * We believe that a PLO that is assisted by the linker to do a large<br>

    part of the work is the right direction forward.<br>

<br>

    * The PLO will focus on the disassembly and analysis and would use the<br>

    linker’s assistance at appropriate points to read the binary and write<br>

    the modified binary.<br>

<br>

    * We have some ideas in this space where the PLO can operate with the<br>

    assistance of the linker.  Towards this, we are working on a proposal<br>

    and prototype for basic block sections.<br>

<br>

<br>

    Iterative PGO experiment:<br>

<br>

    * Early this year, before we started evaluating BOLT, we worked on<br>

    addressing the problem of loss of profile precision from PGO.<br>

<br>

    * Instrumentation based PGO is known to have context-insensitive<br>

    profiles. The profiles after inline transformation are scaled<br>

    uniformly based on the call-site count. Early inlining partially<br>

    mitigates the problem but it’s not enough.<br>

<br>

    * The idea here is to have another round of PGO instrumentation after<br>

    inline transformation. This profile will only be used by the<br>

    optimizations after inline, like basic block layout.<br>

<br>

    * The key here is that the second pass profile is precise in terms of<br>

    context sensitivity and it will not affect the inlining decisions.<br>

<br>

    * We compared BOLT with this approach and found that with basic block<br>

    reordering alone, this approach does as well as BOLT, if not slightly<br>

    better.<br>

<br>

    * We then used BOLT to further optimize this binary and obtained an<br>

    additional speedup of 0.5%, because of function splitting which is not<br>

    available in the compiler today.  It should be noted that BOLT’s basic<br>

    block reordering did not find any additional opportunities for<br>

    performance improvement on this iteratively optimized PGO binary.<br>

<br>

    * Note that BOLT has other optimizations too but the performance<br>

    boosts we obtained were only due to BOLT’s basic block reordering<br>

    (cache+ algorithm) and function splitting.<br>

<br>

<br>

    A case for a Post Link Optimizer (PLO) assisted by the linker:<br>

<br>

    * Our experiments with BOLT show that it is able to work on large<br>

    binaries, with some caveats and bug fixes, and does manage to squeeze<br>

    performance improvements even on very well optimized binaries.<br>

<br>

    * We believe that a Post Link Optimizer (PLO) like BOLT is very useful<br>

    as it provides a framework to do whole program analysis at machine<br>

    code level (on information not exposed at IR level) and enables whole<br>

    program transformations that requires global ordering (e.g. :<br>

    Interprocedural register allocation).<br>

<br>

    * We have in fact prototyped an interprocedural spill optimization in<br>

    BOLT and found non-trivial performance improvements in some of our<br>

    micro-benchmarks.<br>

<br>

    * However, we feel there are shortcomings of BOLT that need to be addressed :<br>

<br>

      + Scalability : Time and memory required currently is not within<br>

    acceptable limits for widespread deployment.<br>

<br>

      + Debug Info:  BOLT requires fixing debug info after the<br>

    transformations, it does a good job of it but more needs to be done<br>

    here.<br>

<br>

      + QA : It is an independent tool and needs rigorous testing.<br>

<br>

    * We are looking at improving the scalability of BOLT, in fact, we<br>

    have been able to reduce the profile conversion time from 40 minutes<br>

    to 2 minutes, but there is still a lot to be done.<br>

<br>

    * We believe that many parts of the PLO can be done with the linker’s<br>

    assistance which can mitigate some of the above shortcomings.<br>

<br>

    * For instance, we are working on a proposal and prototype for basic<br>

    block sections which can let the PLO do basic block reordering and<br>

    function splitting in a scalable manner with the linker’s help.  We<br>

    will share our proposal for this shortly.<br>

<br>

<br>

    Thanks,<br>

    Sri<br>

<br>

<br>

<br>

<br>

<br>

    On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev<br>

    <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>

    ><br>

    > We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:<br>

    ><br>

    > <a href="https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md" rel="noreferrer" target="_blank">https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md</a><br>

    ><br>

    > While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.<br>

    ><br>

    > More details about BOLT are available in our arXiv.org paper at: <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=</a><br>

    ><br>

    > Maksim & Rafael<br>

    ><br>

    ><br>

    > _______________________________________________<br>

    > LLVM Developers mailing list<br>

    > <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

    > <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=</a><br>

    ><br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div></div>