[llvm-dev] [RFC] Profile guided section layout

Mon Jul 31 20:32:05 PDT 2017

On Mon, Jul 31, 2017 at 2:20 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:

> Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
> > I've recently implemented profile guided section layout in llvm + lld
> using
> > the Call-Chain Clustering (C³) heuristic from
> > https://research.fb.com/wp-content/uploads/2017/01/
> cgo2017-hfsort-final1.pdf
> > . In the programs I've tested it on I've gotten from 0% to 5% performance
> > improvement over standard PGO with zero cases of slowdowns and up to 15%
> > reduction in ITLB misses.
> >
> >
> > There are three parts to this implementation.
> >
> > The first is a new llvm pass which uses branch frequency info to get
> counts
> > for each call instruction and then adds a module flags metatdata table of
> > function -> function edges along with their counts.
> >
> > The second takes the module flags metadata and writes it into a
> > .note.llvm.callgraph section in the object file. This currently just
> dumps
> > it as text, but could save space by reusing the string table.
> >
> > The last part is in lld. It reads the .note.llvm.callgraph data from each
> > object file and merges them into a single table. It then builds a call
> > graph based on the profile data then iteratively merges the hottest call
> > edges using the C³ heuristic as long as it would not create a cluster
> > larger than the page size. All clusters are then sorted by a density
> metric
> > to further improve locality.
>
> Since the branch frequency info is in a llvm specific format, it makes
> sense for llvm to read it instead of expecting lld to do it again. Since
> .o files is how the compiler talks to the linker, it also makes sense
> for llvm to record the required information there.
>
> In the same way, since the linker is the first place with global
> knowledge, it makes sense for it to be the one that implements a section
> ordering heuristic instead of just being told by some other tool, which
> would complicate the build.
>
> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.
>
> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to what you propose,
> just weight,caller,callee.
>
> In a another patch we can then look at instrumentation: Why it is more
> convenient for some uses and what performance advantage it might have.
>
> I have written a small tool that usesr intel_bts and 'perf script' to
> construct the callgraph. I am giving it a try with your lld patch and
> will hopefully post results today.
>
> Cheers,
> Rafael
>

I'm fine with approaching this from lld first, as the input reading is the
simplest part of this. I started from instrumentation based because I
started with a target program that already had the infrastructure setup to
do it.

- Michael Spencer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170731/e2bd02cb/attachment.html>