<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, May 19, 2015 at 4:09 PM, Nick Lewycky <span dir="ltr"><<a href="mailto:nlewycky@google.com" target="_blank">nlewycky@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><span class=""><div class="gmail_quote">On 13 May 2015 at 11:44, Teresa Johnson <span dir="ltr"><<a href="mailto:tejohnson@google.com" target="_blank">tejohnson@google.com</a>></span> wrote:<br></div></span><div class="gmail_quote"><span class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I've included below an RFC for implementing ThinLTO in LLVM, looking<br>

forward to feedback and questions.<br></blockquote><div><br></div></span><div>Thanks! I have to admit up front that I haven't read through the whole thread, but I have a couple comments. Overall this looks like a really nice design and unusually thorough RFC!</div><div><div class="h5"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote><div><br></div></div></div><div>This is different from llvm's current LTO approach ("big bang LTO", where we combine all TUs into a single big Module and the optimize and codegen it). It sounds like there's two goals here, multi-machine parallelism and reducing memory usage (by splitting the Module out to multiple machines) and most of the interesting logic goes into deciding where to split a Module.</div><div><br></div><div>I think ThinLTO was designed under the assumption that we would not be able to fit a large program into memory on a single machine (or that even if we could, we wouldn't be able to compile quickly enough by employing multi-core parallelism). This is in contrast to previously considered approaches of improving big bang LTO to handle very large programs through changes to the IR, in-memory representation, on-disk representation and threading. Starting with the assumption that we will need multiple machines, ThinLTO looks like an excellent design. I just wanted to call out that design requirement and how it's different from how llvm has thought about LTO in the past.</div></div></div></div></blockquote><div><br></div><div>ThinLTO is designed to be Corolla, while LTO will continue to be the Mercedes :)</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

The function index/summary will later be added as a special ELF<br>

section alongside the .llvmbc sections.<br></blockquote><div><br></div></div></div><div>We've historically pushed back on adding ELF because it doesn't add any new information that isn't present in the .bc file, and we care a lot about minimizing I/O time (I recall an encoding change in the bitcode format shrinking .bc files 10% which led to a big improvement in LTO times for Darwin).</div></div></div></div></blockquote><div><br></div><div>For LTO, which is already highly stressed, small increase in I/O does matter a lot.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>There's a few practical matters about what needs to be in this ELF symbol table; what about symbols that we reference, instead of just those we define? </div></div></div></div></blockquote><div><br></div><div>UNDEF symbols or match what plugin does.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>what about the sizes of symbols we define? </div></div></div></div></blockquote><div><br></div><div>In elf wrapper, the function is 'defined' in the summary section. Its offset and size is the summary entry's offset and size.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>what about the case where llvm codegen ends up defining (or referencing) a function that isn't mentioned in the IR (a common example is emitting a call to memcpy for argument lowering)?</div></div></div></div></blockquote><div><br></div><div>We don't expect symtab generated for IR matches that with the final object, for instance dead function elimination can happen etc.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div> If you have a set of tools in mind, we can make the ELF accurate enough to work with those tools, but it's not clear to me how to make it work for fully general ELF-expecting programs without doing full codegen into the file (IIRC, this is what GCC does). Are 'ar', 'nm' and 'ld' the only programs?</div></div></div></div></blockquote><div><br></div><div>ranlib, objcopy.</div><div><br></div><div>GCC always wraps IR into ELF wrapper even when it does not generate fat object. However GCC's IR only ELF file have a customized symtab section.</div><div><br></div><div>ICC generates a ELF for IR only case -- with a full ELF symtab generated. It supports fat object file too.</div><div><br></div><div>HP's aCC generates ELF wrapper for intermediate file with full ELF symtab too.  </div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><br></div><div>Finally, suppose you get into a situation where you implement ThinLTO with the elf wrappers and then examine the compile time, memory usage, file size and I/O, and find that ThinLTO isn't performing as well as we like. The next question is going to be "well, what if we removed that extra I/O time, file size (copying time) and memory usage from having that ELF wrapper"? That's why I think of a .bc-only version as being the ideal version, and that having ELF wrapping is a good idea for supporting legacy programs as needed.</div></div></div></div></blockquote><div><br></div><div>I like the best of both worlds. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br></blockquote></div></div><div>I have an idea for a future version.</div><div><br></div><div>Give passes the ability to write their own summary data at compile time, and to read them in the backends. Merge these summaries in the link, then after splitting send the merged summaries to each backend regardless of whether it imports the function body. For instance, dead argument elimination could summarize which functions ignore which arguments (either entirely, or locally except for which arguments in which callees). Receiving a full graph of this is smaller than the full implementations of the functions, and yet would allow each backend to do an analysis of the full graph. Function A's body is in this backend, and A calls B whose body is not available to this backend. The summary would include that the first argument to B is dead, so we can optimize away the chain of computation leading to it in A. (I think a more compelling example will be alias analysis, but it would make for a messier example.)</div><span class="HOEnZb"><font color="#888888"><div><br></div></font></span></div></div></div></blockquote><div><br></div><div>yes -- that is what we had thought about doing for LIPO -- callgraphs, whole program aliases are good candidates.  So what you describe is we'd like to do for thinLTO. Those global analyses are more expensive than the fast indexing, but can be controlled with knobs.</div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="HOEnZb"><font color="#888888"><div></div><div>Nick</div></font></span><div><div class="h5"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">e. ThinLTO importing support:<br>

<br>

Support for the mechanics of importing functions from other modules,<br>

which can go in gradually as a set of patches since it will be off by<br>

default. Separate patches can include:<br>

<br>

- BitcodeReader changes to use function index to import/deserialize<br>

single function of interest (small changes, leverages existing lazy<br>

streamer support).<br>

<br>

- Minor LTOModule changes to pass the ThinLTO function to import and<br>

its index into bitcode reader.<br>

<br>

- Marking of imported functions (for use in ThinLTO-specific symbol<br>

linking and global DCE, for example). This can be in-memory initially,<br>

but IR support may be required in order to support streaming bitcode<br>

out and back in again after importing.<br>

<br>

- ModuleLinker changes to do ThinLTO-specific symbol linking and<br>

static promotion when necessary. The linkage type of imported<br>

functions changes to AvailableExternallyLinkage, for example. Statics<br>

must be promoted in certain cases, and renamed in consistent ways.<br>

<br>

- GlobalDCE changes to support removing imported functions that were<br>

not inlined (very small changes to existing pass logic).<br>

<br>

<br>

f. ThinLTO Import Driver SCC pass:<br>

<br>

Adds Transforms/IPO/ThinLTO.cpp with framework for doing ThinLTO via<br>

an SCC pass, enabled only under -fthinlto options. The pass includes<br>

utilizing the thin archive (global function index/summary), import<br>

decision heuristics, invocation of LTOModule/ModuleLinker routines<br>

that perform the import, and any necessary callgraph updates and<br>

verification.<br>

<br>

<br>

g. Backend Driver:<br>

<br>

For a single node build, the gold plugin can simply write a makefile<br>

and fork the parallel backend instances directly via parallel make.<br>

<br>

<br>

3. Stage 3: ThinLTO Tuning and Enhancements<br>

----------------------------------------------------------------<br>

<br>

This refers to the patches that are not required for ThinLTO to work,<br>

but rather to improve compile time, memory, run-time performance and<br>

usability.<br>

<br>

<br>

a. Lazy Debug Metadata Linking:<br>

<br>

The prototype implementation included lazy importing of module-level<br>

metadata during the ThinLTO pass finalization (i.e. after all function<br>

importing is complete). This actually applies to all module-level<br>

metadata, not just debug, although it is the largest. This can be<br>

added as a separate set of patches. Changes to BitcodeReader,<br>

ValueMapper, ModuleLinker<br>

<br>

<br>

b. Import Tuning:<br>

<br>

Tuning the import strategy will be an iterative process that will<br>

continue to be refined over time. It involves several different types<br>

of changes: adding support for recording additional metrics in the<br>

function summary, such as profile data and optional heavier-weight IPA<br>

analyses, and tuning the import heuristics based on the summary and<br>

callsite context.<br>

<br>

<br>

c. Combined Function Map Pruning:<br>

<br>

The combined function map can be pruned of functions that are unlikely<br>

to benefit from being imported. For example, during the phase-2 thin<br>

archive plug step we can safely omit large and (with profile data)<br>

cold functions, which are unlikely to benefit from being inlined.<br>

Additionally, all but one copy of comdat functions can be suppressed.<br>

<br>

<br>

d. Distributed Build System Integration:<br>

<br>

For a distributed build system, the gold plugin should write the<br>

parallel backend invocations into a makefile, including the mapping<br>

from the IR file to the real object file path, and exit. Additional<br>

work needs to be done in the distributed build system itself to<br>

distribute and dispatch the parallel backend jobs to the build<br>

cluster.<br>

<br>

<br>

e. Dependence Tracking and Incremental Compiles:<br>

<br>

In order to support build systems that stage from local disks or<br>

network storage, the plugin will optionally support computation of<br>

dependent sets of IR files that each module may import from. This can<br>

be computed from profile data, if it exists, or from the symbol table<br>

and heuristics if not. These dependence sets also enable support for<br>

incremental backend compiles.<br>

<span><font color="#888888"><br>

<br>

<br>

--<br>

Teresa Johnson | Software Engineer | <a href="mailto:tejohnson@google.com" target="_blank">tejohnson@google.com</a> | <a href="tel:408-460-2413" value="+14084602413" target="_blank">408-460-2413</a><br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</font></span></blockquote></div></div></div><br></div></div>

<br>_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

<br></blockquote></div><br></div></div>