<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Fri, Oct 19, 2018 at 2:09 PM Maksim Panchenko via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks for sharing your findings, Sri. Implementing and benchmarking<br>
iterative PGO is particularly insightful, and I'm glad the techniques BOLT<br>
uses were able to achieve similar gains and more with function splitting.<br>
<br>
I like the idea of improving BOLT by utilizing information from the linker,<br>
and possible integration with the linker. I assume you mean lld. When<br>
we created BOLT, one of the design goals was to make it work with any<br>
compiler and any linker. </blockquote><div><br></div><div>lld is certainly one of the main linkers we should add support, but I think we should make the design independent of linker (using common interfaces) from the beginning.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">However, if we can get extra benefit in performance<br>
and/or usability from integrating and exploiting functionality of lld, I'm all<br>
for it. At the same time, keeping the standalone functionality will be useful<br>
in scenarios where re-linking (especially with LTO/ThinLTO) is undesirable<br>
or impossible. We should also think about platforms where lld is<br>
not (yet) available.<br>
<br></blockquote><div><br></div><div>It is possible to make PLO works in the standalone mode as BOLT while also operate in a mode that involves a PLO driver, various optimization components and linker. The latter can also heavily make use of compiler generated annotations. </div><div><br></div><div>Sri and Han plans to send some proposal for discussion in the near future.</div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I'm confident with the right design approach, we can achieve both goals,<br>
and make the optimizations part of lld while keeping the standalone tool.<br>
<br>
Running into scalability issues when optimizing 700MB of code is<br>
understandable. At the same time, saving memory at processing time wasn't<br>
our top priority, and I'm sure there's lots of room for improvement.<br>
<br>
Further optimizations of register allocation sounds like a promising idea.<br>
<br>
Looking forward to your patches and the proposal.<br>
<br>
Maksim<br>
<br>
On 10/18/18, 2:20 PM, "Sriraman Tallam" <<a href="mailto:tmsriram@google.com" target="_blank">tmsriram@google.com</a>> wrote:<br>
<br>
Hi,<br>
<br>
<br>
We, the compiler optimization team at Google, have been working for<br>
the past year on further improving the performance of PGO binaries.<br>
We have also evaluated Facebook’s BOLT in this context and we would<br>
like to share our findings and thoughts.<br>
<br>
TLDR; Facebook’s BOLT, a Post Link Optimizer (PLO), is able to<br>
improve the performance of our large well optimized (PGO + ThinLTO)<br>
benchmark binaries but suffers from scalability issues. We believe<br>
that a PLO tool assisted by the linker would solve most of these<br>
issues and we are working on some proposals and prototypes towards<br>
this.<br>
<br>
<br>
Summary of our findings:<br>
<br>
* While PGO gives huge performance speedups, there are lots of<br>
instances where profile information is either inaccurate or absent due<br>
to the various transformations applied. While the profile counts are<br>
precise at the moment they are read, the compiler cannot maintain the<br>
precision of the profiles after some transformations.<br>
<br>
* For example, context sensitive profile information is lost as the<br>
counters for collecting profiles are created before the regular<br>
inliner is invoked.<br>
<br>
* Optimization passes such as branch lowering, jump threading, loop<br>
transformation passes introduce additional control-flow and the<br>
profiles for these branches are estimated and can be very inaccurate<br>
leading to non-optimal layout of basic blocks.<br>
<br>
* Early this year, before we started evaluating BOLT, we worked on<br>
addressing the problem of loss of profile precision from PGO.<br>
<br>
* We have found that an iterative PGO solution, where the second<br>
iteration has optimized context sensitive profiles is able to improve<br>
our key benchmarks by 2%, more details below.<br>
<br>
* We recently evaluated BOLT, a post link optimizer (PLO) on the same<br>
benchmarks. BOLT works on the profiles of the final PGO optimized<br>
binary which captures all the context-sensitive and flow-sensitive<br>
information and hence is not prone to the above issues.<br>
<br>
* We found that BOLT gives us similar performance boosts, mainly due<br>
to basic block reordering. BOLT additionally is able to improve the<br>
same benchmarks by 0.5% due to function splitting.<br>
<br>
* However BOLT suffers from scalability issues, ~45G of memory and<br>
more than an hour to process a 700M binary. It also needs to update<br>
debug info and requires rigorous testing as it is a stand-alone tool.<br>
<br>
* We believe that a PLO that is assisted by the linker to do a large<br>
part of the work is the right direction forward.<br>
<br>
* The PLO will focus on the disassembly and analysis and would use the<br>
linker’s assistance at appropriate points to read the binary and write<br>
the modified binary.<br>
<br>
* We have some ideas in this space where the PLO can operate with the<br>
assistance of the linker. Towards this, we are working on a proposal<br>
and prototype for basic block sections.<br>
<br>
<br>
Iterative PGO experiment:<br>
<br>
* Early this year, before we started evaluating BOLT, we worked on<br>
addressing the problem of loss of profile precision from PGO.<br>
<br>
* Instrumentation based PGO is known to have context-insensitive<br>
profiles. The profiles after inline transformation are scaled<br>
uniformly based on the call-site count. Early inlining partially<br>
mitigates the problem but it’s not enough.<br>
<br>
* The idea here is to have another round of PGO instrumentation after<br>
inline transformation. This profile will only be used by the<br>
optimizations after inline, like basic block layout.<br>
<br>
* The key here is that the second pass profile is precise in terms of<br>
context sensitivity and it will not affect the inlining decisions.<br>
<br>
* We compared BOLT with this approach and found that with basic block<br>
reordering alone, this approach does as well as BOLT, if not slightly<br>
better.<br>
<br>
* We then used BOLT to further optimize this binary and obtained an<br>
additional speedup of 0.5%, because of function splitting which is not<br>
available in the compiler today. It should be noted that BOLT’s basic<br>
block reordering did not find any additional opportunities for<br>
performance improvement on this iteratively optimized PGO binary.<br>
<br>
* Note that BOLT has other optimizations too but the performance<br>
boosts we obtained were only due to BOLT’s basic block reordering<br>
(cache+ algorithm) and function splitting.<br>
<br>
<br>
A case for a Post Link Optimizer (PLO) assisted by the linker:<br>
<br>
* Our experiments with BOLT show that it is able to work on large<br>
binaries, with some caveats and bug fixes, and does manage to squeeze<br>
performance improvements even on very well optimized binaries.<br>
<br>
* We believe that a Post Link Optimizer (PLO) like BOLT is very useful<br>
as it provides a framework to do whole program analysis at machine<br>
code level (on information not exposed at IR level) and enables whole<br>
program transformations that requires global ordering (e.g. :<br>
Interprocedural register allocation).<br>
<br>
* We have in fact prototyped an interprocedural spill optimization in<br>
BOLT and found non-trivial performance improvements in some of our<br>
micro-benchmarks.<br>
<br>
* However, we feel there are shortcomings of BOLT that need to be addressed :<br>
<br>
+ Scalability : Time and memory required currently is not within<br>
acceptable limits for widespread deployment.<br>
<br>
+ Debug Info: BOLT requires fixing debug info after the<br>
transformations, it does a good job of it but more needs to be done<br>
here.<br>
<br>
+ QA : It is an independent tool and needs rigorous testing.<br>
<br>
* We are looking at improving the scalability of BOLT, in fact, we<br>
have been able to reduce the profile conversion time from 40 minutes<br>
to 2 minutes, but there is still a lot to be done.<br>
<br>
* We believe that many parts of the PLO can be done with the linker’s<br>
assistance which can mitigate some of the above shortcomings.<br>
<br>
* For instance, we are working on a proposal and prototype for basic<br>
block sections which can let the PLO do basic block reordering and<br>
function splitting in a scalable manner with the linker’s help. We<br>
will share our proposal for this shortly.<br>
<br>
<br>
Thanks,<br>
Sri<br>
<br>
<br>
<br>
<br>
<br>
On Tue, Oct 16, 2018 at 11:45 AM, Maksim Panchenko via llvm-dev<br>
<<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>
><br>
> We’d like to share how Clang could be made up to 15% faster using code layout optimizations implemented in BOLT – our open-source LLVM-based post-link optimizer. These improvements are achieved on top of PGO and LTO. We thought the results and the practical value will be interesting to the LLVM community. The detailed step-by-step guide of optimizing Clang is available at:<br>
><br>
> <a href="https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md" rel="noreferrer" target="_blank">https://github.com/facebookincubator/BOLT/blob/master/docs/OptimizingClang.md</a><br>
><br>
> While we think that there is room for improvement in LLVM’s code layout algorithm, we strongly believe that, to achieve the maximum benefits for large applications, a post-link optimizer is needed. We would like to know what people think about including BOLT into LLVM family of tools.<br>
><br>
> More details about BOLT are available in our arXiv.org paper at: <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_pdf_1807.06735.pdf&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=YEtCHHYxvyQfF6BmLdmxpaoDnt2T1akiOnmJPRaXij4&e=</a><br>
><br>
> Maksim & Rafael<br>
><br>
><br>
> _______________________________________________<br>
> LLVM Developers mailing list<br>
> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
> <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=" rel="noreferrer" target="_blank">https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=4c9jZ8ZwYXlxUZHyw4Wing&m=pu6Hwao9YVOit9aID6c9ndsxKE4K3j9QWE1gAL8aons&s=ITEUUzjlzs31AKeTD2Z4yAlITUOPDw-74x_ZPKUMn3Q&e=</a><br>
><br>
<br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>