<div dir="ltr"><div class="gmail_extra"><div><div class="gmail_signature">On Mon, Jul 31, 2017 at 2:20 PM, Rafael Avila de Espindola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br></div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-">Michael Spencer via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> writes:<br>

<br>

> I've recently implemented profile guided section layout in llvm + lld using<br>

> the Call-Chain Clustering (C³) heuristic from<br>

> <a href="https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf" rel="noreferrer" target="_blank">https://research.fb.com/wp-<wbr>content/uploads/2017/01/<wbr>cgo2017-hfsort-final1.pdf</a><br>

> . In the programs I've tested it on I've gotten from 0% to 5% performance<br>

> improvement over standard PGO with zero cases of slowdowns and up to 15%<br>

> reduction in ITLB misses.<br>

><br>

><br>

> There are three parts to this implementation.<br>

><br>

> The first is a new llvm pass which uses branch frequency info to get counts<br>

> for each call instruction and then adds a module flags metatdata table of<br>

> function -> function edges along with their counts.<br>

><br>

> The second takes the module flags metadata and writes it into a<br>

> .note.llvm.callgraph section in the object file. This currently just dumps<br>

> it as text, but could save space by reusing the string table.<br>

><br>

> The last part is in lld. It reads the .note.llvm.callgraph data from each<br>

> object file and merges them into a single table. It then builds a call<br>

> graph based on the profile data then iteratively merges the hottest call<br>

> edges using the C³ heuristic as long as it would not create a cluster<br>

> larger than the page size. All clusters are then sorted by a density metric<br>

> to further improve locality.<br>

<br>

</span>Since the branch frequency info is in a llvm specific format, it makes<br>

sense for llvm to read it instead of expecting lld to do it again. Since<br>

.o files is how the compiler talks to the linker, it also makes sense<br>

for llvm to record the required information there.<br>

<br>

In the same way, since the linker is the first place with global<br>

knowledge, it makes sense for it to be the one that implements a section<br>

ordering heuristic instead of just being told by some other tool, which<br>

would complicate the build.<br>

<br>

However, do we need to start with instrumentation? The original paper<br>

uses sampling with good results and current intel cpus can record every<br>

branch in a program.<br>

<br>

I would propose starting with just an lld patch that reads the call<br>

graph from a file. The format would be very similar to what you propose,<br>

just weight,caller,callee.<br>

<br>

In a another patch we can then look at instrumentation: Why it is more<br>

convenient for some uses and what performance advantage it might have.<br>

<br>

I have written a small tool that usesr intel_bts and 'perf script' to<br>

construct the callgraph. I am giving it a try with your lld patch and<br>

will hopefully post results today.<br>

<br>

Cheers,<br>

Rafael<br>

</blockquote></div><br></div><div class="gmail_extra">I'm fine with approaching this from lld first, as the input reading is the simplest part of this. I started from instrumentation based because I started with a target program that already had the infrastructure setup to do it.</div><div class="gmail_extra"><br></div><div class="gmail_extra">- Michael Spencer<br></div></div>