[llvm-dev] [RFC] Profile guided section layout

Mon Jul 31 16:18:04 PDT 2017

On Jul 31, 2017 2:21 PM, "Rafael Avila de Espindola via llvm-dev" <
llvm-dev at lists.llvm.org> wrote:

Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:

> I've recently implemented profile guided section layout in llvm + lld
using
> the Call-Chain Clustering (C³) heuristic from
> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-
hfsort-final1.pdf
> . In the programs I've tested it on I've gotten from 0% to 5% performance
> improvement over standard PGO with zero cases of slowdowns and up to 15%
> reduction in ITLB misses.
>
>
> There are three parts to this implementation.
>
> The first is a new llvm pass which uses branch frequency info to get
counts
> for each call instruction and then adds a module flags metatdata table of
> function -> function edges along with their counts.
>
> The second takes the module flags metadata and writes it into a
> .note.llvm.callgraph section in the object file. This currently just dumps
> it as text, but could save space by reusing the string table.
>
> The last part is in lld. It reads the .note.llvm.callgraph data from each
> object file and merges them into a single table. It then builds a call
> graph based on the profile data then iteratively merges the hottest call
> edges using the C³ heuristic as long as it would not create a cluster
> larger than the page size. All clusters are then sorted by a density
metric
> to further improve locality.

Since the branch frequency info is in a llvm specific format, it makes
sense for llvm to read it instead of expecting lld to do it again. Since
.o files is how the compiler talks to the linker, it also makes sense
for llvm to record the required information there.

In the same way, since the linker is the first place with global
knowledge, it makes sense for it to be the one that implements a section
ordering heuristic instead of just being told by some other tool, which
would complicate the build.

However, do we need to start with instrumentation? The original paper
uses sampling with good results and current intel cpus can record every
branch in a program.

Both sample and instrumentation are turned into the same profile metadata
in the IR which is what feeds BranchFrequencyInfo. So it should be
naturally agnostic to which method is used.

-- Sean Silva

I would propose starting with just an lld patch that reads the call
graph from a file. The format would be very similar to what you propose,
just weight,caller,callee.

In a another patch we can then look at instrumentation: Why it is more
convenient for some uses and what performance advantage it might have.

I have written a small tool that usesr intel_bts and 'perf script' to
construct the callgraph. I am giving it a try with your lld patch and
will hopefully post results today.

Cheers,
Rafael
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170731/f8bd7efe/attachment.html>