[llvm-dev] RFC: PGO Late instrumentation for LLVM

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Tue Aug 11 13:25:06 PDT 2015


On Tue, Aug 11, 2015 at 9:38 AM, Justin Bogner <mail at justinbogner.com> wrote:
> Rong Xu via llvm-dev <llvm-dev at lists.llvm.org> writes:
>> Instrumentation based Profile Guided Optimization (PGO) is a compiler
>> technique that leverages important program runtime information, such as
>> precise edge counts and frequent value information, to make frequently
>> executed code run faster. It's proven to be one of the most effective ways to
>> improve program performance.
>>
>> An important design point of PGO is to decide where to place the
>> instrumentation. In current LLVM PGO, the instrumentation is done in the Clang
>> Front-End (will be referred as FE based instrumentation). Doing this early in
>> the front-end gives better coverage information as the there are more precise
>> line number information. The resulted profile also has relatively high
>> tolerance to compiler changes. All the compiler change after the
>> instrumentation point will not lead to mismatched IR that invalidates the
>> profile.
>>
>> On the other hand, doing this too early gives the compiler fewer opportunities
>> to optimize the code before instrumentation. It has significant impact on
>> instrumentation runtime performance. In addition, it tends to produce larger
>> binary and  profile data file.  Our internal C++ benchmarks shows that FE
>> based instrumentation degrades the performance (compared to non-instrumented
>> version) by 58.3%, and in some extreme cases, the application speed/throughput
>> decreases by 95% (21x runtime slowdown).
>
> While I agree that we can and should improve the overhead of profiling
> instrumentation, I think that there's a lot of low hanging fruit in
> improving the frontend instrumentation that makes some of the
> comparisons below misleading.
>
> The cost of maintaining two separate ways of profiling is fairly high,

I think the initial bring up cost may be high, but the long term
maintenance cost may not -- I expect it to be fairly stable.

>
>
> As Sean pointed out, this problem might be solveable in other ways.
> Beyond the naive approach of simply ignoring small functions with a
> certain hotness that he suggested, the frontend instrumentation could be
> easily tweaked to do better in these cases in a couple of ways.
>
> 1. It's pretty easy to recognize classes of function in the frontend. It
>    would be trivial to optionally ignore getter-style functions or
>    simple constructors, or whatever. This loses some information that's
>    useful for the coverage use case, but not in a very harmful way.

There are many limitations though:

1) functions that can be trivially cleaned up (CFG simplified) after inlining
2) Iwrapper functions which calls another function when exception
handling is on


>
> 2. These will generally be dominated by a counter update in the calling
>    function. We could insert a pass at some point that recognizes redundant
>    calls to instrprof.increment and combines them. This combine could
>    even keep the fidelity of all of the counters by creating extra
>    symbolic counters that map to multiple updates, if we're willing to
>    pay the space cost.

This sounds complicated. How can the pass determine which updates are
really redundant?  It does not know which are for true inline
instances and which are for potential out-of-line instances.

>
>>    * Non-optimal placement of the count updates;
>
> A lot of this has quite a bit of room for improvement in the frontend
> instrumentation. More on this below.
>
>>    * A third factor is related value profiling (to be turned on in the
>>      future). Small and hot callee functions taking function pointer
>>      (callbacks) can incur overhead due to indirect call target
>>      profiling.
>
> Can you elaborate on this? I can't see how a feature we don't have yet
> can contribute to the slowdown, and I don't really understand how the
> late instrumentation would help with whatever the hypothetical problem
> is.
>

The following code is not uncommon:

void callback(void);

template <typename CallBack_> void runner (CallBack_ cbk) {

   cbk(...);
}

void test()
{
     runner(callback);
}

If runner is instantiated with 'callback', there will be an trivially
eliminatable indirect calls in it -- this is similar to other CFG
cleanup that can be done with pre-cleanup passes.

Value Profiling is not in LLVM yet, but we know the work is pending.


>
>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
>> redundant -- the counter won't help downstream optimizations. On the other
>> hand, if bar() is a large function and may not be suitable to be inlined for
>> all callsites, this counter updated is necessary in order to produce more
>> accurate profile data for the out-of-line instance of the callee.
>>
>> If foo() is a hot function, the overhead of updating two counters can be
>> significant. This is especially bad for C++ program where there are tons of
>> small inline functions.
>
> On the other hand, unless foo is bar's only caller, you're losing
> information by dropping one of these counters. How do you determine when
> this transformation is safe?

That is the point of splitting inlining into 1) pre-profiling inlining
and 2) regular post profile inlining.  Pre-inlining does not require
profile data and is safe to do.   profile-use and profile-instr will
share the same pipeline, so what profile-use sees after pre-inlining
is a complete
picture of the out of inline instance's profile.  Note that regular
inlining can also inline functions after instrumentation is done, but
those inline copies will still update the original counters just like
the FE based instrumentation.

Note that the above serves the optimizer perfectly, but not necessary
satisfies coverage test's requirement which is FE instrumentation's
strength.

>
>> There is another missing opportunity in FE based instrumentation. The small
>> functions’ control flow can usually be simplified when they are inlined into
>> caller contexts. Once the control flow is simplified, many counter updates can
>> therefore be eliminated. This is only possible for a middle end based late
>> instrumenter. Defining a custom clean-up pass to remove redundant counter
>> update is unrealistic and cannot be done in a sane way without destroying the
>> profile integrity of neither the out-of-line nor inline instances of the
>> callee.
>
> Wait, but doing it this way destroys the profile integrity of one of
> those anyway - it isn't updating all of the counters. How is doing this
> in a separate pass any different?
>

See above, it is does not destroy profile integrity of the
out-of-inline instances. For inline instances, since their control
flows are now integrated with the caller's CFG, it allows a more
globally optimal edge selection and profile update placement.



>> The direction of the branch inside bar will be totally opposite between two
>> different callsites in ‘caller’. Without pre-inlining, the branch probability
>> will be 50-50 which will be useless for later optimizations. With
>> pre-inlining, the profile will have the perfect branch count for each
>> callsite. The positive performance impact of context sensitive profiling due
>> to pre-inlining has been observed in real world large C++ programs. Supporting
>> context sensitive profiling is another way to solve this, but it will
>> introduce large additional runtime/memory overhead.
>
> So this a pretty interesting transformation. Do you have any data on
> exactly how large the overhead of doing this with a context sensitive
> approach would be, or how much extra complexity would entail?

I would expect the cost of doing so is quite high -- it is not
uncommon to have some callees to have hundreds of static callsites, so
even one level context sensitivity can incur high cost. The caller
code also needs to setup the contexts in order for the callee to
update the right counter set. Profile-use also becomes more
complicated.


e data
>> racing issue is orthogonal to the problem we try to solve here.
>
> I suspect most of these heuristics can be applied to the frontend
> approach today. User annotations are visible to the frontend by
> definition, so we can pretty easily modify our counter placement based
> on those.

The static prediction logic needs to be duplicated in Clang or any
other FEs in order to support this.


thanks,

David

> Using previous profiles introduces a need to record somewhere
> why we made an "abnormal" decision, but that problem exists with the
> late instrumentation as well.
>
>> 2. High Level Design
>>
>> We propose to perform a pre-profile inline pass before the PGO instrumentation
>> pass. Since the instrumentation pass is after inine, it has to be done in the
>> middle-end.
>>
>> (1) The pre-inline pass
>> We will invoke a pre-inline pass before the instrumentation. When PGO is on,
>> the inlining will be split into two passes:
>>    * A pre-inline pass that is scheduled before the profile instrumentation/
>> annotation
>>    * A post-inline pass which is the regular inline pass after instrumentation
>> /annotation
>> By design, all beneficial callsites without requiring profile data should be
>> inlined in the pre-inline pass. It includes all callsites that will shrink
>> code size after inlining. All the remaining callsites will be left to the
>> regular inline pass when profile data is available.
>>
>> After pre-inline, a CFG based profile instrumentation/annotation will be done.
>> A minimum weight spanning tree (MST) in CFG is first computed, then only the
>> edges not in the MST will be instrumented. The counter update instructions are
>> placed in the basic blocks.
>>
>> (2) Minimum Spanning Tree (MST) based instrumentation
>> A native way of instrumentation is to insert a count update for every edge in
>> CFG which will result in  too many redundant updates that makes the runtime
>> very slow. Knuth [1] proposed a minimum spanning tree based method: given a
>> CFG, first compute a spanning tree. All edges that not in the MST will be
>> instrumented. In the profile use compilation, the counters are populated (from
>> the leaf of the spanning tree) to all the edges. Knuth proved this method
>> inserts the minimum number of instrumentations. MST based method only
>> guarantees the number static instrumentation are minimized, not the dynamic
>> instance of instrumentation. To reduce the number of dynamic instrumentation,
>> edges of potentially high counts will be put into MST first so that they will
>> have less chance to be instrumented.
>>
>> 3. Experimental Results
>>
>> 3.1 Measurement of the efficiency of instrumentation
>> Other than the runtime of the instrumented binaries, a more direct measurement
>> of the instrumentation overhead is the the sum of the raw profile count
>> values. Note that regardless what kind of instrumentations are used, the raw
>> profile count should be able to reconstruct all the edge count values for the
>> whole program. All raw profile value are obtained via incrementing the counter
>> variable value by one. The sum of the raw profile count value is roughly the
>> dynamic instruction count of the instrumented code. The lower of the value,
>> the more efficient of the instrumentation.
>>
>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K eon
>> The performance speedup is computed by (FE_instrumentation_runtime /
>> ME_instrumentation_runtime - 1)
>>
>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from
>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small
>> ups and downs across different programs. Late instrumentation improves hmmer a
>> lot, but it is probably due to unrelated loop optimizations (90% of runtime
>> spent in one loop nest).
>>
>> For C++ programs, the performance impact of late instrumentation is very
>> large, which is as expected. The following table shows the result.  For some
>> C++ program, the time speedup is huge. For example, in  483.xalancbmk, late
>> instrumentation speeds up performance by ~60%.  Among all the SPEC C++
>> programs, only 444.namd is an outlier -- it uses a lot of macros and is a very
>> C like program.
>>
>> Program           Speedup
>> 471.omnetpp       16.03%
>> 473.astar          5.00%
>> 483.xalancbmk     58.57%
>> 444.namd          -0.90%
>> 447.dealII        60.47%
>> 450.soplex         8.20%
>> 453.povray        11.34%
>> 252.eon           35.33%
>> -------------------------
>> Geomean           21.01%
>>
>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
>> We also collect some statistic of the profiles generated by FE based
>> instrumentation and late instrumentation, namely, the following information:
>>    1. the number of functions that being instrumented,
>>    2. the result profile file size,
>>    3. the sum of raw count values that was mentioned earlier -- we used it to
>> measure the efficiency of the instrumentation.
>> Next table shows the ratios of the each metrics by late instrumentation for
>> the C++ programs, with FE based instrumentation as the base : column (1) shows
>> the ratios of instrumented functions; column (2) shows the ratios of the
>> profile file size; column (3) shows the ratios of the sum of raw count values.
>>
>>                 (1)       (2)       (3)
>> 471.omnetpp    85.36%   110.26%    46.52%
>> 473.astar      64.86%    72.72%    63.13%
>> 483.xalancbmk  51.83%    56.11%    35.77%
>> 444.namd       75.36%    82.82%    85.77%
>> 447.dealII     43.42%    46.46%    26.75%
>> 450.soplex     71.80%    87.54%    51.19%
>> 453.povray     78.68%    83.57%    64.37%
>> 252.eon        72.06%    91.22%    30.02%
>> ----------------------------------------
>> Geomean        66.50%    76.36%    47.01%
>>
>> For FE based instrumentation, profile count variables generated for the dead
>> functions will not be removed (like __llvm_prf_names, __llvm_prf_data,  and
>> __llvm_prf_cnts) from the data/text segment, nor in the result profile. There
>> is a recent patch that removes these unused data for COMDAT functions, but
>> that patch won’t touch regular functions. This is the main reason for the
>> larger number of instrumented functions and larger profile file size for the
>> FE based instrumentation. The reduction of the sum of raw count values is
>> mainly due to the elimination of redundant profile updates enabled by the
>> pre-inlining.
>>
>> For C programs, we observe similar improvement in the profile size (geomean
>> ratio of 73.75%) and smaller improvement in the number of instrumented
>> functions (geomean ratio of 87.49%) and the sum of raw count values (geomean
>> of 82.76%).
>>
>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++
>> benchmarks
>>
>> We also use Google internal benchmarks (mostly typical C++ applications) to
>> measure the relative performance b/w FE based instrumentation and late
>> instrumentation.  The following table shows the speedup of late
>> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a
>> very large multi-threaded C++ program. Late instrumentation sees 4x speedup.
>> Larger than 3x speedups are also seen in many other programs.
>>
>> C++_bencharmk01    416.98%
>> C++_bencharmk02      6.29%
>> C++_bencharmk03     22.39%
>> C++_bencharmk04     28.05%
>> C++_bencharmk05      2.00%
>> C++_bencharmk06    675.89%
>> C++_bencharmk07    359.19%
>> C++_bencharmk08    395.03%
>> C_bencharmk09       15.11%
>> C_bencharmk10        5.47%
>> C++_bencharmk11      5.73%
>> C++_bencharmk12      2.31%
>> C++_bencharmk13     87.73%
>> C++_bencharmk14      7.22%
>> C_bencharmk15       -0.51%
>> C++_bencharmk16     59.15%
>> C++_bencharmk17    358.82%
>> C++_bencharmk18    861.36%
>> C++_bencharmk19     29.62%
>> C++_bencharmk20     11.82%
>> C_bencharmk21        0.53%
>> C++_bencharmk22     43.10%
>> ---------------------------
>> Geomean             83.03%
>>
>> 3.5 Statistics of LLVM profiles for Google internal benchmarks
>>
>> The following shows the profile statics using Google internal benchmarks.
>>                          (1)       (2)       (3)
>> C++_bencharmk01         36.84%    40.29%     2.32%
>> C++_bencharmk02         39.20%    40.49%    42.39%
>> C++_bencharmk03         39.37%    40.65%    23.24%
>> C++_bencharmk04         39.13%    40.68%    17.70%
>> C++_bencharmk05         36.58%    38.27%    51.08%
>> C++_bencharmk06         29.50%    27.87%     2.87%
>> C++_bencharmk07         29.50%    27.87%     1.73%
>> C++_bencharmk08         29.50%    27.87%     4.17%
>> C_bencharmk09           53.95%    68.00%    11.18%
>> C_bencharmk10           53.95%    68.00%    31.74%
>> C++_bencharmk11         36.40%    37.07%    46.12%
>> C++_bencharmk12         38.44%    41.90%    73.59%
>> C++_bencharmk13         39.28%    42.72%    29.56%
>> C++_bencharmk14         38.59%    42.20%    13.42%
>> C_bencharmk15           57.45%    48.50%    66.99%
>> C++_bencharmk16         36.86%    42.18%    16.53%
>> C++_bencharmk17         37.82%    39.77%    13.68%
>> C++_bencharmk18         37.82%    39.77%     7.96%
>> C++_bencharmk19         37.52%    40.46%     1.85%
>> C++_bencharmk20         32.37%    30.44%    19.69%
>> C_bencharmk21           37.63%    40.42%    88.81%
>> C++_bencharmk22         36.28%    36.92%    21.62%
>> --------------------------------------------------
>> Geomean                 38.22%    39.96%    15.58%
>>
>> 4. Implementation Details:
>>
>> We need to add new option(s) for the alternative PGO instrumentation pass in
>> the middle end. It can in one of the following forms:
>>
>>    (1) Complete new options that are on par with current PGO options:
>> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation, and
>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
>>    (2) Or, late instrumentation can be turned on with an additional option
>> -fprofile-instr-late with current PGO options. I. e. -fprofile-instr-late
>> -fprofile-instr-generate[=<profile_file>]? for PGO instrumentation, and
>> -fprofile-instr-late -fprofile-instr-use[=<profile_file>]? for PGO use.
>>    (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO
>> instrumentation. Adding a magic tag in profile so that FE based profile and
>> late instrumented profile can be automatically detected by profile loader In
>> PGO use compilation. This requires a slight profile format change.
>>
>> In our prototype implementation, two new passes are added in the beginning of
>> PassManagerBuilder::populateModulePassManager(), namely PreProfileInlinerPass
>> and PGOLateInstrumentationPass.
>>
>> 4.1 Pre-inline pass:
>>
>> It is controlled by back-end option "-preinline" and "-disable-preinline". If
>> the user specifies any llvm option of "-fprofile-late-instr-{generate|use},
>> option "-mllvm -preinline" will be automatically inserted in the driver.. To
>> disable the pre-inliner when late instrumentation is enabled, use option
>> "-mllvm -disable-preinline".
>>
>> For now, only minimum tuning is done for the pre-inliner, which simply adjusts
>> the inline threshold: If -Oz is specified, the threshold is set to 25.
>> Otherwise, it is 75.
>>
>> The following clean up passes are added to PassManager, right after the
>> PreProfileInline pass:
>>   createEarlyCSEPass()
>>   createJumpThreadingPass()
>>   createCorrelatedValuePropagationPass()
>>   createCFGSimplificationPass()
>>   createInstructionCombiningPass()
>>   createGVNPass(DisableGVNLoadPRE)
>>   createPeepholePASS()
>> Some of them might not be necessary.
>>
>> 4.2 Late Instrumentation Pass:
>> The late instrumentation is right after the pre-inline pass and it's cleanup
>> passes. It is controlled by opt option "-pgo-late-instr-gen" and
>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will
>> provide the profile name.
>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() is
>> also added to PassManager to lower the instrumentation intrinsics
>>
>> PGOLateInstrumeatnion is a module pass that applies the instrumentation to
>> each function by class PGOLateInstrumentationFunc. For each function, perform
>> the following steps:
>>    1. First collect all the CFG edges. Assign an estimated weight to each
>> edge. Critical edges and back-edges are assigned to high value of weights. One
>> fake node and a few fake edges (from the fake node to the entry node, and from
>> all the exit nodes to the fake node) are also added to the worklist.
>>    2. Construct the MST. The edges with the higher weight will be put to MST
>> first, unless it forms a cycle.
>>    3. Traverse the CFG and compute the CFG hash using CRC32 of the index of
>> each BB.
>> The above three steps are the same for profile-generate and profile-use
>> compilation.
>>
>> In the next step, for profile-generation compilation, all the edges that not
>> in the MST are instrumented. If this is a critical edge, split the edge first.
>> The actual instrumentation is to generate Intrinsic::instrprof_increment() in
>> the instrumented BB. This intrinsic will be lowed by pass
>> createInstrProfilingPass().
>>
>> In the next step, for profile-generation compilation, all the edges that not
>> in the MST are instrumented. If this is a critical edge, split the edge first.
>> The actual instrumentation is to generate Intrinsic::instrprof_increment() in
>> the instrumented BB. This intrinsic will be lowed by pass
>> createInstrProfilingPass().
>>
>> For -fprofile-use compilation, first read in the counters and the CFG hash
>> from the profile file. If the CFG hash matches, populate the counters to all
>> the edges in reverse topological order of the MST. Once having all the edge
>> counts, set the branch weights metadata for the IR having multiple branches.
>> Also apply the cold/hot function attributes based on function level counts.
>>
>> 4.3 Profile Format:
>>
>> The late instrumentation profile is mostly the same as the one from front-end
>> instrument-ion. The difference is
>>    * Function checksums are different.
>>    * Function entry counts are no longer available.
>> For llvm-profdata utility, options -lateinstr needs to be used to
>> differentiate FE based and late instrumentation profiles, unless a magic tag
>> is added to the profile.
>
> If we do this, we'll probably want to add something to the file to
> differentiate the profiles, otherwise the error messages will be pretty
> bad.
>
>> 5. References:
>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points for
>> program frequency counts. BIT Numerical Mathematics 1973, Volume 13, Issue 3,
>> pp 313-322
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


More information about the llvm-dev mailing list