<div dir="ltr"><div class="gmail_extra"><div><div class="gmail_signature">On Tue, Aug 1, 2017 at 1:57 PM, Justin Bogner <span dir="ltr"><<a href="mailto:mail@justinbogner.com" target="_blank">mail@justinbogner.com</a>></span> wrote:<br></div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail-HOEnZb"><div class="gmail-h5">Rafael Avila de Espindola via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> writes:<br>

> Michael Spencer via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> writes:<br>

><br>

>> I've recently implemented profile guided section layout in llvm + lld using<br>

>> the Call-Chain Clustering (C³) heuristic from<br>

>> <a href="https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf" rel="noreferrer" target="_blank">https://research.fb.com/wp-<wbr>content/uploads/2017/01/<wbr>cgo2017-hfsort-final1.pdf</a><br>

>> . In the programs I've tested it on I've gotten from 0% to 5% performance<br>

>> improvement over standard PGO with zero cases of slowdowns and up to 15%<br>

>> reduction in ITLB misses.<br>

>><br>

>><br>

>> There are three parts to this implementation.<br>

>><br>

>> The first is a new llvm pass which uses branch frequency info to get counts<br>

>> for each call instruction and then adds a module flags metatdata table of<br>

>> function -> function edges along with their counts.<br>

>><br>

>> The second takes the module flags metadata and writes it into a<br>

>> .note.llvm.callgraph section in the object file. This currently just dumps<br>

>> it as text, but could save space by reusing the string table.<br>

>><br>

>> The last part is in lld. It reads the .note.llvm.callgraph data from each<br>

>> object file and merges them into a single table. It then builds a call<br>

>> graph based on the profile data then iteratively merges the hottest call<br>

>> edges using the C³ heuristic as long as it would not create a cluster<br>

>> larger than the page size. All clusters are then sorted by a density metric<br>

>> to further improve locality.<br>

><br>

> Since the branch frequency info is in a llvm specific format, it makes<br>

> sense for llvm to read it instead of expecting lld to do it again. Since<br>

> .o files is how the compiler talks to the linker, it also makes sense<br>

> for llvm to record the required information there.<br>

><br>

> In the same way, since the linker is the first place with global<br>

> knowledge, it makes sense for it to be the one that implements a section<br>

> ordering heuristic instead of just being told by some other tool, which<br>

> would complicate the build.<br>

><br>

> However, do we need to start with instrumentation? The original paper<br>

> uses sampling with good results and current intel cpus can record every<br>

> branch in a program.<br>

<br>

</div></div>This already works without instrumentation. You can probably try it out<br>

as is with profiles generated with linux perf using the create_llvm_prof<br>

tool from the autofdo work: <a href="https://github.com/google/autofdo" rel="noreferrer" target="_blank">https://github.com/google/<wbr>autofdo</a></blockquote><div><br></div><div>I'm pretty sure by "start with instrumentation" he means start with the restrictions it imposes on having to traffic the data through the object file.</div><div><br></div><div>- Michael Spencer<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

<span class="gmail-im gmail-HOEnZb"><br>

> I would propose starting with just an lld patch that reads the call<br>

> graph from a file. The format would be very similar to what you propose,<br>

> just weight,caller,callee.<br>

><br>

> In a another patch we can then look at instrumentation: Why it is more<br>

> convenient for some uses and what performance advantage it might have.<br>

><br>

> I have written a small tool that usesr intel_bts and 'perf script' to<br>

> construct the callgraph. I am giving it a try with your lld patch and<br>

> will hopefully post results today.<br>

><br>

> Cheers,<br>

> Rafael<br>

</span><div class="gmail-HOEnZb"><div class="gmail-h5">> ______________________________<wbr>_________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

</div></div></blockquote></div><br></div></div>