Automatic PGO - Initial implementation (1/N)

Thu Sep 26 13:29:38 PDT 2013

On Thu, Sep 26, 2013 at 4:04 PM, Evan Cheng <evan.cheng at apple.com> wrote:
>
> On Sep 26, 2013, at 4:55 AM, Diego Novillo <dnovillo at google.com> wrote:
>
>> On Wed, Sep 25, 2013 at 8:54 PM, Evan Cheng <evan.cheng at apple.com> wrote:
>>
>>> Hmm. Scalar transformation seems *wrong* but you are right it can't an analysis pass. A couple of ideas:
>>>
>>> 1. Can we implement it not as a pass but a utility that clients can use?
>>> 2. Move it to lib/Transform/Instrumentation?
>>
>> One hard requirement for this feature is to be enabled as a regular
>> compiler option. Users are expecting this interface:
>>
>> $ clang -O2 -fauto-profile foo.cc -o foo
>>
>> (the actual name of the flag is irrelevant, of course)
>>
>> I am not sure what you mean by utility in the context of LLVM. Can I
>> implement utilities so that the above interface works?
>
> $ clang -O2 -fauto-profile foo.cc -o foo
>
> What does this do? Is it annotating the IR for one source file? Or is this producing an annotated executable? If it's the former, then I see your point.

The -fauto-profile option will read the profile information and emit
metadata in the IR for all the functions it sees in the translation
unit.  For example, if line 10 of function foo() was executed 10,000
times, all the IR instructions associated with line 10 will get the
annotation 'autoprofile.samples 10000' (or some such).

This information is then used by the analysis routines when computing
block and edge weights. Instructions that have a large fraction of
collected samples will automatically mark CFG paths as hot. The side
effect, then, is that the optimizers will be able to build more
accurate cost models, which translates to better optimization
decisions.

So, there is no instrumentation generated.  The profile information is
entirely generated by an external profiling source as the original
binary executed.

The whole auto profile optimization cycle looks like this:

1- Compile the application with minimal debug information (just line
tables should be fine) and the usual optimization flags.

2- Run the binary in its production environment alongside a profiling
source (e.g., run it under perf or have oprofile enabled or ...
anything that produces profile info).

3- Run a tool that converts the external profile information into a
format that -fauto-profile understands. The current implementation
understands flat profiles like:

symbol table
2
_Z7computei
main
_Z7computei:230803:0:9
1: 0
2: 0
3: 3
4: 0
5: 10142
6: 9956
7: 69748
9: 5
10: 0
[ ...]

This, very simplistic, profile says that there were two functions
sampled: main and compute. In compute(), there were a total of
~231,000 samples.  Some of which were mapped to specific lines in the
function.  In particular, line 7 was sampled ~70,000 times.  That
makes it very hot wrt to the rest of the function.

Also note that there is some degree of lossiness in the profile. Not
all the samples were mapped to line numbers. That's OK, this method is
known to be lossy and produces inferior results to traditional
instrumentation.

4- Go back to #1, and add -fauto-profile to the build. The collected
samples get translated into IR annotations, which (in turn) affect the
analysis done to compute branch/edge weights and other analyses that
rely on execution frequency. This is where the benefits of the profile
information are incorporated into the optimization decisions.

Steps #1 and #2 are the usual steps done during application
development. The auto profile framework adds step #3.

> I don't really have a strong opinion. It just seems a bit off to call this a scalar transformation pass since that term
> usually means some kind of optimization pass.

Sure. This pass does not perform an optimization in the traditional
sense. I'm indifferent to where we put it, as long as it can be
enabled with a compiler flag.

Thanks.  Diego.