[llvm-dev] RFC: PGO Late instrumentation for LLVM

Fri Aug 7 22:56:05 PDT 2015

Accidentally sent to uiuc server.

On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> wrote:

> Can you compare your results with another approach: simply do not
> instrument the top 1% hottest functions (by function entry count)? If this
> simple approach provides most of the benefits (my measurements on one
> codebase I tested show that it would eliminate over 97% of total function
> counts), we may be able to use a simpler approach.
>
> The biggest thing I notice about this proposal is that although the focus
> of the proposal is on reducing profiling overhead, it requires using a
> different pass pipeline during *both* the "instr-generate" compilation and
> the "instr-use" compilation. Therefore it is more than simply a reduction
> in profiling overhead -- it can also change performance even ignoring
> adding the profiling information in the IR. I think it would be interesting
> to test the modified pass pipeline even in the absence of profile
> information to better understand its effects in isolation (for example,
> maybe it would be good to add these passes during -O3 regardless of whether
> we are doing PGO).
>
> -- Sean Silva
>
> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote:
>
>> Instrumentation based Profile Guided Optimization (PGO) is a compiler
>> technique that leverages important program runtime information, such as
>> precise edge counts and frequent value information, to make frequently
>> executed code run faster. It's proven to be one of the most effective ways
>> to improve program performance.
>>
>> An important design point of PGO is to decide where to place the
>> instrumentation. In current LLVM PGO, the instrumentation is done in the
>> Clang Front-End (will be referred as FE based instrumentation). Doing this
>> early in the front-end gives better coverage information as the there are
>> more precise line number information. The resulted profile also has
>> relatively high tolerance to compiler changes. All the compiler change
>> after the instrumentation point will not lead to mismatched IR that
>> invalidates the profile.
>>
>> On the other hand, doing this too early gives the compiler fewer
>> opportunities to optimize the code before instrumentation. It has
>> significant impact on instrumentation runtime performance. In addition, it
>> tends to produce larger binary and  profile data file.  Our internal C++
>> benchmarks shows that FE based instrumentation degrades the performance
>> (compared to non-instrumented version) by 58.3%, and in some extreme cases,
>> the application speed/throughput decreases by 95% (21x runtime slowdown).
>>
>> Running the instrumented binary too slow is not desirable in PGO for the
>> following reasons:
>>    * This slows down already lengthy build time. In the worst case, the
>> instrumented binary is so slow that it fails to run a representative
>> workload, because slow execution can leads to more time-outs in many server
>> programs. Other typical issues include: text size too big, failure to link
>> instrumented binaries, memory usage exceeding system limits.
>>    * Slow runtime affects the program behavior. Real applications
>> sometimes monitor the program runtime and have different execution path
>> when the program run too slow. This would defeat the underlying assumption
>> of PGO and make it less effective.
>>
>> This work proposes an option to turn on middle-end based instrumentation
>> (new) that aims to speed up instrumentation runtime. The new
>> instrumentation is referred to as ME based instrumentation in this
>> document. Our experimental results show that ME instrumentation can speed
>> up the instrumentation by 80% on average for typical C++ programs. Here are
>> the two main design objectives:
>>    * Co-existing with FE Instrumenter: We do not propose to replace the
>> FE based instrumentation because FE based instrumentation has its
>> advantages and applications. User can choose which phase to do
>> instrumentation via command line options.
>>    * PGO Runtime Support Sharing: The ME instrumenter will completely
>> re-use the existing PGO’s runtime support.
>>
>> 1. FE Based Instrumentation Runtime Overhead Analysis
>>
>> Instrumented binaries are expected to run slower due to instrumentation
>> code inserted. With FE based instrumentation, the overhead is especially
>> high and runtime slowdown can be unacceptable in many cases.  Further
>> analysis shows that there are 3  important factors  contributing to FE
>> instrumentation slowdown :
>>    * [Main] Redundant counter updates of inlined functions. C++ programs
>> can introduce large abstraction penalties by using lots of small inline
>> functions (assignment operators, getters, setters, ctors/dtors etc).
>> Overhead of instrumenting those small functions can be very large, making
>> training runs too slow and in some cases to usable;
>>    * Non-optimal placement of the count updates;
>>    * A third factor is related value profiling (to be turned on in the
>> future). Small and hot callee functions taking function pointer (callbacks)
>> can incur  overhead due to indirect call target profiling.
>>
>>
>> 1.1 Redundant Counter Update
>>
>> If checking the assembly of the instrumented binary generated by current
>> LLVM implementation, we can find many sequence of consecutive 'incq'
>> instructions that updating difference counters in the same basic block. As
>> an example that extracted from real binary:
>>   ...
>>  incq   0xa91d80(%rip)        # 14df4b8
>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
>>  incq   0xa79011(%rip)        # 14c6750
>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
>>  incq   0xa79442(%rip)        # 14c6b88
>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
>>  incq   0x9c288b(%rip)        # 140ffd8
>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
>>  ...
>>
>> From profile use point of view, many of these counter update are
>> redundant. Considering the following example:
>> void bar(){
>>  sum++;
>> }
>> void foo() {
>>  bar();
>> }
>>
>> FE based instrumentation needs to insert counter update for the only BB
>> of the bar().
>> bar:                                    # @bar
>> # BB#0:                                 # %entry
>>        incq    .L__llvm_profile_counters_bar(%rip)
>>        incl    sum(%rip)
>>        retq
>>
>> It also need to insert the update the BB in function foo().  After
>> inlining bar to foo(), the code is:
>> foo:                                    # @foo
>> # BB#0:                                 # %entry
>>        incq    .L__llvm_profile_counters_foo(%rip)
>>        incq    .L__llvm_profile_counters_bar(%rip)
>>        incl    sum(%rip)
>>        retq
>>
>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
>> redundant -- the counter won't help downstream optimizations. On the other
>> hand, if bar() is a large function and may not be suitable to be inlined
>> for all callsites, this counter updated is necessary in order to produce
>> more accurate profile data for the out-of-line instance of the callee.
>>
>> If foo() is a hot function, the overhead of updating two counters can be
>> significant. This is especially bad for C++ program where there are tons of
>> small inline functions.
>>
>> There is another missing opportunity in FE based instrumentation. The
>> small functions’ control flow can usually be simplified when they are
>> inlined into caller contexts. Once the control flow is simplified, many
>> counter updates can therefore be eliminated. This is only possible for a
>> middle end based late instrumenter. Defining a custom clean-up pass to
>> remove redundant counter update is unrealistic and cannot be done in a sane
>> way without destroying the profile integrity of neither the out-of-line nor
>> inline instances of the callee.
>>
>> A much simpler and cleaner solution is to do a pre-inline pass to inline
>> all the trivial inlines before instrumentation.  In addition to removing
>> the unnecessary count updates for the inline instances,  another advantage
>> of pre-inline is to  provide context sensitive profile for these small
>> inlined functions. This context senstive profile can further improve the
>> PGO based optimizations. Here is a contrived example:
>> void bar (int n) {
>>   if (n&1)
>>     do_sth1();
>>   else
>>     do_sth2();
>> }
>>
>> void caller() {
>>   int s = 1;
>>   for (; s<100; s+=2)
>>     bar(s);
>>
>>   for (s = 102; s< 200; s+=2)
>>     bar(s);
>> }
>>
>> The direction of the branch inside bar will be totally opposite between
>> two different callsites in ‘caller’. Without pre-inlining, the branch
>> probability will be 50-50 which will be useless for later optimizations.
>> With pre-inlining, the profile will have the perfect branch count for each
>> callsite. The positive performance impact of context sensitive profiling
>> due to pre-inlining has been observed in real world large C++ programs.
>> Supporting context sensitive profiling is another way to solve this, but it
>> will introduce large additional runtime/memory overhead.
>>
>>
>> 1.2 Non-optimal placement of count update
>>
>> Another much smaller showdown factor is the placement of the counter
>> updates. Current front-end based instrumentation applies the
>> instrumentation to each front-end lexical construct. It also minimizes the
>> number of static instrumentations. Note that it always instruments the
>> entry count of the CFG. This may result in higher dynamic instruction
>> counts. For example,
>>      BB0
>>      | 100
>>     BB1
>> 90 /   \ 10
>>   BB2  BB3
>> 90 \   / 10
>>     BB4
>> Like the the above example, FE based instrumentation will always insert
>> count update in BB0.  The dynamic instrumentation count will be either 110
>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better
>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic
>> instrumentation count is 100.
>>
>> Our experimental shows that the optimal placement based on edge hotness
>> can improve instrumented code performance by about 10%.  While it’s hard to
>> find the optimal placement of count update,  compiler heuristics can be
>> used the get the better placement. These heuristics  can be based on static
>> profile prediction or user annotations (like __buildin_expect) to  estimate
>> the relative edge hotness and put instrumentations on the less hot edges.
>> The initial late instrumentation has not fully implemented this placement
>> strategy yet.  With that implemented, we expect even better results than
>> what is reported here. For real world programs, another major source of the
>> slowdown is the data racing and false sharing of the counter update for
>> highly threaded programs. Pre-inlining can alleviate this problem as the
>> counters in the inline instances are not longer shared. But the complete
>> solution to the data racing issue is orthogonal to the problem we try to
>> solve here.
>>
>>
>> 2. High Level Design
>>
>> We propose to perform a pre-profile inline pass before the PGO
>> instrumentation pass. Since the instrumentation pass is after inine, it has
>> to be done in the middle-end.
>>
>> (1) The pre-inline pass
>> We will invoke a pre-inline pass before the instrumentation. When PGO is
>> on, the inlining will be split into two passes:
>>    * A pre-inline pass that is scheduled before the profile
>> instrumentation/annotation
>>    * A post-inline pass which is the regular inline pass after
>> instrumentation/annotation
>> By design, all beneficial callsites without requiring profile data should
>> be inlined in the pre-inline pass. It includes all callsites that will
>> shrink code size after inlining. All the remaining callsites will be left
>> to the regular inline pass when profile data is available.
>>
>> After pre-inline, a CFG based profile instrumentation/annotation will be
>> done. A minimum weight spanning tree (MST) in CFG is first computed, then
>> only the edges not in the MST will be instrumented. The counter update
>> instructions are placed in the basic blocks.
>>
>> (2) Minimum Spanning Tree (MST) based instrumentation
>> A native way of instrumentation is to insert a count update for every
>> edge in CFG which will result in  too many redundant updates that makes the
>> runtime very slow. Knuth [1] proposed a minimum spanning tree based method:
>> given a CFG, first compute a spanning tree. All edges that not in the MST
>> will be instrumented. In the profile use compilation, the counters are
>> populated (from the leaf of the spanning tree) to all the edges. Knuth
>> proved this method inserts the minimum number of instrumentations. MST
>> based method only guarantees the number static instrumentation are
>> minimized, not the dynamic instance of instrumentation. To reduce the
>> number of dynamic instrumentation, edges of potentially high counts will be
>> put into MST first so that they will have less chance to be instrumented.
>>
>>
>> 3. Experimental Results
>>
>> 3.1 Measurement of the efficiency of instrumentation
>> Other than the runtime of the instrumented binaries, a more direct
>> measurement of the instrumentation overhead is the the sum of the raw
>> profile count values. Note that regardless what kind of instrumentations
>> are used, the raw profile count should be able to reconstruct all the edge
>> count values for the whole program. All raw profile value are obtained via
>> incrementing the counter variable value by one. The sum of the raw profile
>> count value is roughly the dynamic instruction count of the instrumented
>> code. The lower of the value, the more efficient of the instrumentation.
>>
>>
>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K
>> eon
>> The performance speedup is computed by (FE_instrumentation_runtime /
>> ME_instrumentation_runtime - 1)
>>
>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from
>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small
>> ups and downs across different programs. Late instrumentation improves
>> hmmer a lot, but it is probably due to unrelated loop optimizations (90% of
>> runtime spent in one loop nest).
>>
>> For C++ programs, the performance impact of late instrumentation is very
>> large, which is as expected. The following table shows the result.  For
>> some C++ program, the time speedup is huge. For example, in  483.xalancbmk,
>> late instrumentation speeds up performance by ~60%.  Among all the SPEC C++
>> programs, only 444.namd is an outlier -- it uses a lot of macros and is a
>> very C like program.
>>
>> Program           Speedup
>> 471.omnetpp       16.03%
>> 473.astar          5.00%
>> 483.xalancbmk     58.57%
>> 444.namd          -0.90%
>> 447.dealII        60.47%
>> 450.soplex         8.20%
>> 453.povray        11.34%
>> 252.eon           35.33%
>> -------------------------
>> Geomean           21.01%
>>
>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
>> We also collect some statistic of the profiles generated by FE based
>> instrumentation and late instrumentation, namely, the following information:
>>    1. the number of functions that being instrumented,
>>    2. the result profile file size,
>>    3. the sum of raw count values that was mentioned earlier -- we used
>> it to measure the efficiency of the instrumentation.
>> Next table shows the ratios of the each metrics by late instrumentation
>> for the C++ programs, with FE based instrumentation as the base : column
>> (1) shows the ratios of instrumented functions; column (2) shows the ratios
>> of the profile file size; column (3) shows the ratios of the sum of raw
>> count values.
>>
>>                 (1)       (2)       (3)
>> 471.omnetpp    85.36%   110.26%    46.52%
>> 473.astar      64.86%    72.72%    63.13%
>> 483.xalancbmk  51.83%    56.11%    35.77%
>> 444.namd       75.36%    82.82%    85.77%
>> 447.dealII     43.42%    46.46%    26.75%
>> 450.soplex     71.80%    87.54%    51.19%
>> 453.povray     78.68%    83.57%    64.37%
>> 252.eon        72.06%    91.22%    30.02%
>> ----------------------------------------
>> Geomean        66.50%    76.36%    47.01%
>>
>>
>> For FE based instrumentation, profile count variables generated for the
>> dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data,
>>  and __llvm_prf_cnts) from the data/text segment, nor in the result
>> profile. There is a recent patch that removes these unused data for COMDAT
>> functions, but that patch won’t touch regular functions. This is the main
>> reason for the larger number of instrumented functions and larger profile
>> file size for the FE based instrumentation. The reduction of the sum of raw
>> count values is mainly due to the elimination of redundant profile updates
>> enabled by the pre-inlining.
>>
>> For C programs, we observe similar improvement in the profile size
>> (geomean ratio of 73.75%) and smaller improvement in the number of
>> instrumented functions (geomean ratio of 87.49%) and the sum of raw count
>> values (geomean of 82.76%).
>>
>>
>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++
>> benchmarks
>>
>> We also use Google internal benchmarks (mostly typical C++ applications)
>> to measure the relative performance b/w FE based instrumentation and late
>> instrumentation.  The following table shows the speedup of late
>> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a
>> very large multi-threaded C++ program. Late instrumentation sees 4x
>> speedup. Larger than 3x speedups are also seen in many other programs.
>>
>> C++_bencharmk01    416.98%
>> C++_bencharmk02      6.29%
>> C++_bencharmk03     22.39%
>> C++_bencharmk04     28.05%
>> C++_bencharmk05      2.00%
>> C++_bencharmk06    675.89%
>> C++_bencharmk07    359.19%
>> C++_bencharmk08    395.03%
>> C_bencharmk09       15.11%
>> C_bencharmk10        5.47%
>> C++_bencharmk11      5.73%
>> C++_bencharmk12      2.31%
>> C++_bencharmk13     87.73%
>> C++_bencharmk14      7.22%
>> C_bencharmk15       -0.51%
>> C++_bencharmk16     59.15%
>> C++_bencharmk17    358.82%
>> C++_bencharmk18    861.36%
>> C++_bencharmk19     29.62%
>> C++_bencharmk20     11.82%
>> C_bencharmk21        0.53%
>> C++_bencharmk22     43.10%
>> ---------------------------
>> Geomean             83.03%
>>
>>
>> 3.5 Statistics of LLVM profiles for Google internal benchmarks
>>
>> The following shows the profile statics using Google internal benchmarks.
>>                          (1)       (2)       (3)
>> C++_bencharmk01         36.84%    40.29%     2.32%
>> C++_bencharmk02         39.20%    40.49%    42.39%
>> C++_bencharmk03         39.37%    40.65%    23.24%
>> C++_bencharmk04         39.13%    40.68%    17.70%
>> C++_bencharmk05         36.58%    38.27%    51.08%
>> C++_bencharmk06         29.50%    27.87%     2.87%
>> C++_bencharmk07         29.50%    27.87%     1.73%
>> C++_bencharmk08         29.50%    27.87%     4.17%
>> C_bencharmk09           53.95%    68.00%    11.18%
>> C_bencharmk10           53.95%    68.00%    31.74%
>> C++_bencharmk11         36.40%    37.07%    46.12%
>> C++_bencharmk12         38.44%    41.90%    73.59%
>> C++_bencharmk13         39.28%    42.72%    29.56%
>> C++_bencharmk14         38.59%    42.20%    13.42%
>> C_bencharmk15           57.45%    48.50%    66.99%
>> C++_bencharmk16         36.86%    42.18%    16.53%
>> C++_bencharmk17         37.82%    39.77%    13.68%
>> C++_bencharmk18         37.82%    39.77%     7.96%
>> C++_bencharmk19         37.52%    40.46%     1.85%
>> C++_bencharmk20         32.37%    30.44%    19.69%
>> C_bencharmk21           37.63%    40.42%    88.81%
>> C++_bencharmk22         36.28%    36.92%    21.62%
>> --------------------------------------------------
>> Geomean                 38.22%    39.96%    15.58%
>>
>>
>> 4. Implementation Details:
>>
>> We need to add new option(s) for the alternative PGO instrumentation pass
>> in the middle end. It can in one of the following forms:
>>
>>    (1) Complete new options that are on par with current PGO options:
>> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation,
>> and -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
>>    (2) Or, late instrumentation can be turned on with an additional
>> option -fprofile-instr-late with current PGO options. I. e.
>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO
>> instrumentation, and -fprofile-instr-late
>> -fprofile-instr-use[=<profile_file>]? for PGO use.
>>    (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO
>> instrumentation. Adding a magic tag in profile so that FE based profile and
>> late instrumented profile can be automatically detected by profile loader
>> In PGO use compilation. This requires a slight profile format change.
>>
>> In our prototype implementation, two new passes are added in the
>> beginning of PassManagerBuilder::populateModulePassManager(), namely
>> PreProfileInlinerPass and PGOLateInstrumentationPass.
>>
>>
>> 4.1 Pre-inline pass:
>>
>> It is controlled by back-end option "-preinline" and
>> "-disable-preinline". If the user specifies any llvm option of
>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be
>> automatically inserted in the driver.. To disable the pre-inliner when late
>> instrumentation is enabled, use option "-mllvm -disable-preinline".
>>
>> For now, only minimum tuning is done for the pre-inliner, which simply
>> adjusts the inline threshold: If -Oz is specified, the threshold is set to
>> 25. Otherwise, it is 75.
>>
>> The following clean up passes are added to PassManager, right after the
>> PreProfileInline pass:
>>   createEarlyCSEPass()
>>   createJumpThreadingPass()
>>   createCorrelatedValuePropagationPass()
>>   createCFGSimplificationPass()
>>   createInstructionCombiningPass()
>>   createGVNPass(DisableGVNLoadPRE)
>>   createPeepholePASS()
>> Some of them might not be necessary.
>>
>> 4.2 Late Instrumentation Pass:
>> The late instrumentation is right after the pre-inline pass and it's
>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and
>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will
>> provide the profile name.
>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass()
>> is also added to PassManager to lower the instrumentation intrinsics
>>
>> PGOLateInstrumeatnion is a module pass that applies the instrumentation
>> to each function by class PGOLateInstrumentationFunc. For each function,
>> perform the following steps:
>>    1. First collect all the CFG edges. Assign an estimated weight to each
>> edge. Critical edges and back-edges are assigned to high value of weights.
>> One fake node and a few fake edges (from the fake node to the entry node,
>> and from all the exit nodes to the fake node) are also added to the
>> worklist.
>>    2. Construct the MST. The edges with the higher weight will be put to
>> MST first, unless it forms a cycle.
>>    3. Traverse the CFG and compute the CFG hash using CRC32 of the index
>> of each BB.
>> The above three steps are the same for profile-generate and profile-use
>> compilation.
>>
>> In the next step, for profile-generation compilation, all the edges that
>> not in the MST are instrumented. If this is a critical edge, split the edge
>> first. The actual instrumentation is to generate
>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic
>> will be lowed by pass createInstrProfilingPass().
>>
>> In the next step, for profile-generation compilation, all the edges that
>> not in the MST are instrumented. If this is a critical edge, split the edge
>> first. The actual instrumentation is to generate
>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic
>> will be lowed by pass createInstrProfilingPass().
>>
>> For -fprofile-use compilation, first read in the counters and the CFG
>> hash from the profile file. If the CFG hash matches, populate the counters
>> to all the edges in reverse topological order of the MST. Once having all
>> the edge counts, set the branch weights metadata for the IR having multiple
>> branches. Also apply the cold/hot function attributes based on function
>> level counts.
>>
>>
>> 4.3 Profile Format:
>>
>> The late instrumentation profile is mostly the same as the one from
>> front-end instrument-ion. The difference is
>>    * Function checksums are different.
>>    * Function entry counts are no longer available.
>> For llvm-profdata utility, options -lateinstr needs to be used to
>> differentiate FE based and late instrumentation profiles, unless a magic
>> tag is added to the profile.
>>
>>
>> 5. References:
>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points
>> for program frequency counts. BIT Numerical Mathematics 1973, Volume 13,
>> Issue 3, pp 313-322
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150807/0104ecd7/attachment-0001.html>