[llvm-dev] Making Clang/LLVM faster using code layout optimizations

Fri Oct 19 14:30:17 PDT 2018

On Fri, Oct 19, 2018 at 2:09 PM Maksim Panchenko via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Thanks for sharing your findings, Sri. Implementing and benchmarking
> iterative PGO is particularly insightful, and I'm glad the techniques BOLT
> uses were able to achieve similar gains and more with function splitting.
>
> I like the idea of improving BOLT by utilizing information from the linker,
> and possible integration with the linker. I assume you mean lld. When
> we created BOLT, one of the design goals was to make it work with any
> compiler and any linker.

lld is certainly one of the main linkers we should add support, but I think
we should make the design independent of linker (using common interfaces)
from the beginning.

> However, if we can get extra benefit in performance
> and/or usability from integrating and exploiting functionality of lld, I'm
> all
> for it. At the same time, keeping the standalone functionality will be
> useful
> in scenarios where re-linking (especially with LTO/ThinLTO) is undesirable
> or impossible. We should also think about platforms where lld is
> not (yet) available.
>
>
It is possible to make PLO works in the standalone mode as BOLT while also
operate in a mode that involves a PLO driver, various optimization
components and linker. The latter can also heavily make use of compiler
generated annotations.

Sri and Han plans to send some proposal for discussion in the near future.

thanks,

David

> I'm confident with the right design approach, we can achieve both goals,
> and make the optimizations part of lld while keeping the standalone tool.
>
> Running into scalability issues when optimizing 700MB of code is
> understandable. At the same time, saving memory at processing time wasn't
> our top priority, and I'm sure there's lots of room for improvement.
>
> Further optimizations of register allocation sounds like a promising idea.
>
> Looking forward to your patches and the proposal.
>
> Maksim
>
> On 10/18/18, 2:20 PM, "Sriraman Tallam" <tmsriram at google.com> wrote:
>
>     Hi,
>
>
>       We, the compiler optimization team at Google, have been working for
>     the past year on further improving the performance of PGO binaries.
>     We have also evaluated Facebook’s BOLT in this context and we would
>     like to share our findings and thoughts.
>
>     TLDR;  Facebook’s BOLT, a Post Link Optimizer (PLO), is able to
>     improve the performance of our large well optimized (PGO + ThinLTO)
>     benchmark binaries but suffers from scalability issues. We believe
>     that a PLO tool assisted by the linker would solve most of these
>     issues and we are working on some proposals and prototypes towards
>     this.
>
>
>     Summary of our findings:
>
>     * While PGO gives huge performance speedups, there are lots of
>     instances where profile information is either inaccurate or absent due
>     to the various  transformations applied. While the profile counts are
>     precise at the moment they are read, the compiler cannot maintain the
>     precision of the profiles after some transformations.
>
>     * For example, context sensitive profile information is lost as the
>     counters for collecting profiles are created before the regular
>     inliner is invoked.
>
>     * Optimization passes such as branch lowering, jump threading, loop
>     transformation passes introduce additional control-flow and the
>     profiles for these branches are estimated and can be very inaccurate
>     leading to non-optimal layout of basic blocks.
>
>     * Early this year, before we started evaluating BOLT, we worked on
>     addressing the problem of loss of profile precision from PGO.
>
>     * We have found that an iterative PGO solution, where the second
>     iteration has optimized context sensitive profiles is able to improve
>     our key benchmarks by 2%, more details below.
>
>     * We recently evaluated BOLT, a post link optimizer (PLO) on the same
>     benchmarks.  BOLT works on the profiles of the final PGO optimized
>     binary which captures all the context-sensitive and flow-sensitive
>     information and hence is not prone to the above issues.
>
>     * We found that BOLT gives us similar performance boosts, mainly due
>     to basic block reordering.   BOLT additionally is able to improve the
>     same benchmarks by 0.5% due to function splitting.
>
>     * However BOLT suffers from scalability issues, ~45G of memory and
>     more than an hour to process a 700M binary. It also needs to update
>     debug info and requires rigorous testing as it is a stand-alone tool.
>
>     * We believe that a PLO that is assisted by the linker to do a large
>     part of the work is the right direction forward.
>
>     * The PLO will focus on the disassembly and analysis and would use the
>     linker’s assistance at appropriate points to read the binary and write
>     the modified binary.
>
>     * We have some ideas in this space where the PLO can operate with the
>     assistance of the linker.  Towards this, we are working on a proposal
>     and prototype for basic block sections.
>
>
>     Iterative PGO experiment:
>
>     * Early this year, before we started evaluating BOLT, we worked on
>     addressing the problem of loss of profile precision from PGO.
>
>     * Instrumentation based PGO is known to have context-insensitive
>     profiles. The profiles after inline transformation are scaled
>     uniformly based on the call-site count. Early inlining partially
>     mitigates the problem but it’s not enough.
>
>     * The idea here is to have another round of PGO instrumentation after
>     inline transformation. This profile will only be used by the
>     optimizations after inline, like basic block layout.
>
>     * The key here is that the second pass profile is precise in terms of
>     context sensitivity and it will not affect the inlining decisions.
>
>     * We compared BOLT with this approach and found that with basic block
>     reordering alone, this approach does as well as BOLT, if not slightly
>     better.
>
>     * We then used BOLT to further optimize this binary and obtained an
>     additional speedup of 0.5%, because of function splitting which is not
>     available in the compiler today.  It should be noted that BOLT’s basic
>     block reordering did not find any additional opportunities for
>     performance improvement on this iteratively optimized PGO binary.
>
>     * Note that BOLT has other optimizations too but the performance
>     boosts we obtained were only due to BOLT’s basic block reordering
>     (cache+ algorithm) and function splitting.
>
>
>     A case for a Post Link Optimizer (PLO) assisted by the linker:
>
>     * Our experiments with BOLT show that it is able to work on large
>     binaries, with some caveats and bug fixes, and does manage to squeeze
>     performance improvements even on very well optimized binaries.
>
>     * We believe that a Post Link Optimizer (PLO) like BOLT is very useful
>     as it provides a framework to do whole program analysis at machine
>     code level (on information not exposed at IR level) and enables whole
>     program transformations that requires global ordering (e.g. :
>     Interprocedural register allocation).
>
>     * We have in fact prototyped an interprocedural spill optimization in
>     BOLT and found non-trivial performance improvements in some of our
>     micro-benchmarks.
>
>     * However, we feel there are shortcomings of BOLT that need to be
> addressed :
>
>       + Scalability : Time and memory required currently is not within
>     acceptable limits for widespread deployment.
>
>       + Debug Info:  BOLT requires fixing debug info after the
>     transformations, it does a good job of it but more needs to be done
>     here.
>
>       + QA : It is an independent tool and needs rigorous testing.
>
>     * We are looking at improving the scalability of BOLT, in fact, we
>     have been able to reduce the profile conversion time from 40 minutes
>     to 2 minutes, but there is still a lot to be done.
>
>     * We believe that many parts of the PLO can be done with the linker’s
>     assistance which can mitigate some of the above shortcomings.
>
>     * For instance, we are working on a proposal and prototype for basic
>     block sections which can let the PLO do basic block reordering and
>     function splitting in a scalable manner with the linker’s help.  We
>     will share our proposal for this shortly.
>
>
>     Thanks,
>     Sri
>
>
>
>
>
>     On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev
>     <llvm-dev at lists.llvm.org> wrote:
>     >
>     > We’d like to share how Clang could be made up to 15% faster using
> code layout optimizations implemented in BOLT – our open-source LLVM-based
> post-link optimizer. These improvements are achieved on top of PGO and LTO.
> We thought the results and the practical value will be interesting to the
> LLVM community. The detailed step-by-step guide of optimizing Clang is
> available at:
>     >
>     >
> https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md
>     >
>     > While we think that there is room for improvement in LLVM’s code
> layout algorithm, we strongly believe that, to achieve the maximum benefits
> for large applications, a post-link optimizer is needed. We would like to
> know what people think about including BOLT into LLVM family of tools.
>     >
>     > More details about BOLT are available in our arXiv.org paper at:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=
>     >
>     > Maksim & Rafael
>     >
>     >
>     > _______________________________________________
>     > LLVM Developers mailing list
>     > llvm-dev at lists.llvm.org
>     >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=
>     >
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181019/ea1e0c7c/attachment.html>