<div dir="ltr">(repost the reply using my personal account -- previous reply to the list got hold up)<div><br></div><div><span class="im" style="font-family:monospace">On Thu, Dec 25, 2014 at 11:55 PM, Adve, Vikram Sadanand<br><<a href="mailto:vadve@illinois.edu">vadve@illinois.edu</a>> wrote:<br></span><span class="im" style="font-family:monospace">> Diego, Teresa, David,<br>><br>> Sorry for my delayed reply; I left for vacation right after sending my message about this.<br>><br>> Diego, it wasn't explicit from your message whether LLVM LTO can handle Firefox-scale programs, which you said GCC can handle.  I assumed that's what you meant, but could you confirm that?  I understand that neither can handle the very large Google applications, but that's probably not a near-term concern for a project like the one Charles is embarking on.<br><br></span><span style="font-family:monospace">Vikram, LLVM can handle Firefox size programs. Honza wrote two good</span><br style="font-family:monospace"><span style="font-family:monospace">articles about LTO.</span><br style="font-family:monospace"><br style="font-family:monospace"><a href="http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-1-brief.html" target="_blank" style="font-family:monospace">http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-1-brief.html</a><br style="font-family:monospace"><a href="http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-2-firefox.html" target="_blank" style="font-family:monospace">http://hubicka.blogspot.com/2014/04/linktime-optimization-in-gcc-2-firefox.html</a><br style="font-family:monospace"><br style="font-family:monospace"><span style="font-family:monospace">Comparison with LLVM is described in the second article. It took about</span><br style="font-family:monospace"><span style="font-family:monospace">40min to finish building Firefox with llvm using lto and -g. The</span><br style="font-family:monospace"><span style="font-family:monospace">following is a quote:</span><br style="font-family:monospace"><br style="font-family:monospace"><span style="font-family:monospace">"This graph shows issues with debug info memory use. LLVM goes up to</span><br style="font-family:monospace"><span style="font-family:monospace">35GB. LLVM developers are also working on debug info merging</span><br style="font-family:monospace"><span style="font-family:monospace">improvements (equivalent to what GCC's type merging is) and the</span><br style="font-family:monospace"><span style="font-family:monospace">situation has improved in last two releases until the current shape.</span><br style="font-family:monospace"><span style="font-family:monospace">Older LLVM checkouts happily run out of 60GB memory & 60GB swap on my</span><br style="font-family:monospace"><span style="font-family:monospace">machine.".</span><br style="font-family:monospace"><span class="im" style="font-family:monospace"><br>><br>> I'd be interested to hear more about the LTO design you folks are working on, whenever you're ready to share the details.<br><br></span><span style="font-family:monospace">We will share the details as soon as we can -- possibly some time in Jan 2015.</span><br style="font-family:monospace"><span class="im" style="font-family:monospace"><br>> I read the GCC design docs on LTO, and I'm curious how similar or different your approach will be.  For example, the 3-phase approach of WHOPR is fairly sophisticated (it actually follows closely some research done at Rice U. and IBM on scalable interprocedural analysis, in the same group where Preston did his Ph.D.).<br><br></span><span style="font-family:monospace">In Google, we care mostly about peak optimization performance. Peak</span><br style="font-family:monospace"><span style="font-family:monospace">Optimization is basically PGO + CMO. For cross-module optimization</span><br style="font-family:monospace"><span style="font-family:monospace">(CMO) to be usable for large applications, small memory footprint is</span><br style="font-family:monospace"><span style="font-family:monospace">just one aspect of it,  and fast build time is equally important. Peak</span><br style="font-family:monospace"><span style="font-family:monospace">optimization is not only used in release build but  in developer</span><br style="font-family:monospace"><span style="font-family:monospace">workflow too. This means build time with CMO should be close to O2</span><br style="font-family:monospace"><span style="font-family:monospace">time as much as possible.  It is important to compiler engineers too</span><br style="font-family:monospace"><span style="font-family:monospace">-- you don't want to wait for more than 20min to hit a breakpoint in</span><br style="font-family:monospace"><span style="font-family:monospace">debugging a compiler problem :)</span><br style="font-family:monospace"><br style="font-family:monospace"><span style="font-family:monospace">For this reason, GCC LTO is not used in Google. Instead, the much more</span><br style="font-family:monospace"><span style="font-family:monospace">scalable solution called LIPO is widely used for CMO:</span><br style="font-family:monospace"><a href="https://gcc.gnu.org/wiki/LightweightIpo" target="_blank" style="font-family:monospace">https://gcc.gnu.org/wiki/LightweightIpo</a><span style="font-family:monospace">. LIPO by design requires PGO.</span><br style="font-family:monospace"><br style="font-family:monospace"><span style="font-family:monospace">While LIPO is scalable, it has its own limitation that prevents the</span><br style="font-family:monospace"><span style="font-family:monospace">compiler from maximizing the benefit of CMO. The new design is</span><br style="font-family:monospace"><span style="font-family:monospace">intended to solve the problem with more very aggressive objectives.</span><br style="font-family:monospace"><span style="font-family:monospace">The new design is pretty simple and shares the basic principles of</span><br style="font-family:monospace"><span style="font-family:monospace">LIPO without requiring PGO (though it still works best with PGO). It</span><br style="font-family:monospace"><span style="font-family:monospace">still fits in LTO framework, so that toolchain support change is</span><br style="font-family:monospace"><span style="font-family:monospace">minimized. For now, without giving details, I can share  some of the</span><br style="font-family:monospace"><span style="font-family:monospace">objectives of the new design:</span><br style="font-family:monospace"><br style="font-family:monospace"><span style="font-family:monospace">    * The build should be almost fully parallelizable (at both process </span></div><div><span style="font-family:monospace">      level and build machine node level)</span><br style="font-family:monospace"><span style="font-family:monospace">    * The build should scale to programs with *any/unlimited* size</span><br style="font-family:monospace"><span style="font-family:monospace">      (measured in number of TUs). It should handle programs 10x, or 100x </span></div><div><span style="font-family:monospace">       the </span><span style="font-family:monospace">size of Firefox.</span><br style="font-family:monospace"><span style="font-family:monospace">    * The build time should be very close to non-LTO build, and can be</span><br style="font-family:monospace"><span style="font-family:monospace">      considered to be turned on *by default* for O2 or at least O3</span><br style="font-family:monospace"><span style="font-family:monospace">      compilations.</span><br style="font-family:monospace"><span style="font-family:monospace">    * When turned on the by default, it can eliminate the need for</span><br style="font-family:monospace"><span style="font-family:monospace">      users to put inline functions in header files (thus greatly help</span><br style="font-family:monospace"><span style="font-family:monospace">      improving parsing time)</span><br style="font-family:monospace"><span style="font-family:monospace">    * Most of the benefit of CMO comes from cross module inlining and</span><br style="font-family:monospace"><span style="font-family:monospace">      cross module indirect call promotions.  By default, the design </span></div><div><span style="font-family:monospace">      only </span><span style="font-family:monospace">enables these two, but it is still compatible with any whole </span></div><div><span style="font-family:monospace">      program </span><span style="font-family:monospace">analysis which can be turned on with additional options.</span><br style="font-family:monospace"><span class="im" style="font-family:monospace"><br></span>thanks,</div><div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Dec 25, 2014 at 11:55 PM, Adve, Vikram Sadanand <span dir="ltr"><<a href="mailto:vadve@illinois.edu" target="_blank">vadve@illinois.edu</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Diego, Teresa, David,<br>

<br>

Sorry for my delayed reply; I left for vacation right after sending my message about this.<br>

<br>

Diego, it wasn't explicit from your message whether LLVM LTO can handle Firefox-scale programs, which you said GCC can handle.  I assumed that's what you meant, but could you confirm that?  I understand that neither can handle the very large Google applications, but that's probably not a near-term concern for a project like the one Charles is embarking on.<br>

<br>

I'd be interested to hear more about the LTO design you folks are working on, whenever you're ready to share the details.  I read the GCC design docs on LTO, and I'm curious how similar or different your approach will be.  For example, the 3-phase approach of WHOPR is fairly sophisticated (it actually follows closely some research done at Rice U. and IBM on scalable interprocedural analysis, in the same group where Preston did his Ph.D.).<br>

<br>

For now, I would like to introduce you all to Charles, so that he has access to people working on this issue, which will probably continue to be a concern for his project.  I have copied you on my reply to him.<br>

<br>

Thanks for the information.<br>

<span class="im HOEnZb"><br>

--Vikram S. Adve<br>

Visiting Professor, Computer Science, EPFL<br>

Professor, Department of Computer Science<br>

University of Illinois at Urbana-Champaign<br>

<a href="mailto:vadve@illinois.edu">vadve@illinois.edu</a><br>

<a href="http://llvm.org" target="_blank">http://llvm.org</a><br>

<br>

<br>

<br>

<br>

</span><div class="HOEnZb"><div class="h5">On Dec 16, 2014, at 3:48 AM, Teresa Johnson <<a href="mailto:tejohnson@google.com">tejohnson@google.com</a>> wrote:<br>

<br>

> On Fri, Dec 12, 2014 at 1:59 PM, Diego Novillo <<a href="mailto:dnovillo@google.com">dnovillo@google.com</a>> wrote:<br>

>> On 12/12/14 15:56, Adve, Vikram Sadanand wrote:<br>

>>><br>

>>> I've been asked how LTO in LLVM compares to equivalent capabilities<br>

>>> in GCC.  How do the two compare in terms of scalability?  And<br>

>>> robustness for large applications?<br>

>><br>

>><br>

>> Neither GCC nor LLVM can handle our (Google) large applications. They're<br>

>> just too massive for the kind of linking done by LTO.<br>

>><br>

>> When we built GCC's LTO, we were trying to address this by creating a<br>

>> partitioned model, where the analysis phase and the codegen phase are split<br>

>> to allow working on partial callgraphs<br>

>> (<a href="http://gcc.gnu.org/wiki/LinkTimeOptimization" target="_blank">http://gcc.gnu.org/wiki/LinkTimeOptimization</a> for details).<br>

>><br>

>> This allows to split and parallelize the initial bytecode generation and the<br>

>> final optimization/codegen. However, the analysis is still implemented as a<br>

>> single process. We found that we cannot even load summaries, types and<br>

>> symbols in an efficient way.<br>

>><br>

>> It does allow for programs like Firefox to be handled. So, if by "big" you<br>

>> need to handle something of that size, this model can doit.<br>

>><br>

>> With LLVM, I can't even load the IR for one of our large programs on a box<br>

>> with 64Gb of RAM.<br>

>><br>

>>> Also, are there any ongoing efforts or plans to improve LTO in LLVM<br>

>>> in the near future?<br>

>><br>

>><br>

>> Yes. We are going to be investing in this area very soon. David and Teresa<br>

>> (CC'd) will have details.<br>

><br>

> Still working out the details, but we are investigating a solution<br>

> that is scalable to very large programs. We'll share the design in the<br>

> near future when we have more details worked out so that we can get<br>

> feedback.<br>

><br>

> Thanks!<br>

> Teresa<br>

><br>

>><br>

>><br>

>> Diego.<br>

><br>

><br>

><br>

> --<br>

> Teresa Johnson | Software Engineer | <a href="mailto:tejohnson@google.com">tejohnson@google.com</a> | <a href="tel:408-460-2413" value="+14084602413">408-460-2413</a><br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div></div>