[llvm-dev] [RFC] Profile guided section layout

Mon Jul 31 14:20:45 PDT 2017

Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:

> I've recently implemented profile guided section layout in llvm + lld using
> the Call-Chain Clustering (C³) heuristic from
> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
> . In the programs I've tested it on I've gotten from 0% to 5% performance
> improvement over standard PGO with zero cases of slowdowns and up to 15%
> reduction in ITLB misses.
>
>
> There are three parts to this implementation.
>
> The first is a new llvm pass which uses branch frequency info to get counts
> for each call instruction and then adds a module flags metatdata table of
> function -> function edges along with their counts.
>
> The second takes the module flags metadata and writes it into a
> .note.llvm.callgraph section in the object file. This currently just dumps
> it as text, but could save space by reusing the string table.
>
> The last part is in lld. It reads the .note.llvm.callgraph data from each
> object file and merges them into a single table. It then builds a call
> graph based on the profile data then iteratively merges the hottest call
> edges using the C³ heuristic as long as it would not create a cluster
> larger than the page size. All clusters are then sorted by a density metric
> to further improve locality.

Since the branch frequency info is in a llvm specific format, it makes
sense for llvm to read it instead of expecting lld to do it again. Since
.o files is how the compiler talks to the linker, it also makes sense
for llvm to record the required information there.

In the same way, since the linker is the first place with global
knowledge, it makes sense for it to be the one that implements a section
ordering heuristic instead of just being told by some other tool, which
would complicate the build.

However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.

I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.

In a another patch we can then look at instrumentation: Why it is more
convenient for some uses and what performance advantage it might have.

I have written a small tool that usesr intel_bts and 'perf script' to
construct the callgraph. I am giving it a try with your lld patch and
will hopefully post results today.

Cheers,
Rafael