[llvm-dev] [RFC] Profile guided section layout

Tue Aug 1 13:57:02 PDT 2017

Rafael Avila de Espindola via llvm-dev <llvm-dev at lists.llvm.org> writes:
> Michael Spencer via llvm-dev <llvm-dev at lists.llvm.org> writes:
>
>> I've recently implemented profile guided section layout in llvm + lld using
>> the Call-Chain Clustering (C³) heuristic from
>> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
>> . In the programs I've tested it on I've gotten from 0% to 5% performance
>> improvement over standard PGO with zero cases of slowdowns and up to 15%
>> reduction in ITLB misses.
>>
>>
>> There are three parts to this implementation.
>>
>> The first is a new llvm pass which uses branch frequency info to get counts
>> for each call instruction and then adds a module flags metatdata table of
>> function -> function edges along with their counts.
>>
>> The second takes the module flags metadata and writes it into a
>> .note.llvm.callgraph section in the object file. This currently just dumps
>> it as text, but could save space by reusing the string table.
>>
>> The last part is in lld. It reads the .note.llvm.callgraph data from each
>> object file and merges them into a single table. It then builds a call
>> graph based on the profile data then iteratively merges the hottest call
>> edges using the C³ heuristic as long as it would not create a cluster
>> larger than the page size. All clusters are then sorted by a density metric
>> to further improve locality.
>
> Since the branch frequency info is in a llvm specific format, it makes
> sense for llvm to read it instead of expecting lld to do it again. Since
> .o files is how the compiler talks to the linker, it also makes sense
> for llvm to record the required information there.
>
> In the same way, since the linker is the first place with global
> knowledge, it makes sense for it to be the one that implements a section
> ordering heuristic instead of just being told by some other tool, which
> would complicate the build.
>
> However, do we need to start with instrumentation? The original paper
> uses sampling with good results and current intel cpus can record every
> branch in a program.

This already works without instrumentation. You can probably try it out
as is with profiles generated with linux perf using the create_llvm_prof
tool from the autofdo work: https://github.com/google/autofdo

> I would propose starting with just an lld patch that reads the call
> graph from a file. The format would be very similar to what you propose,
> just weight,caller,callee.
>
> In a another patch we can then look at instrumentation: Why it is more
> convenient for some uses and what performance advantage it might have.
>
> I have written a small tool that usesr intel_bts and 'perf script' to
> construct the callgraph. I am giving it a try with your lld patch and
> will hopefully post results today.
>
> Cheers,
> Rafael
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev