[llvm-dev] RFC: PGO Late instrumentation for LLVM
Sean Silva via llvm-dev
llvm-dev at lists.llvm.org
Fri Aug 7 22:56:05 PDT 2015
Accidentally sent to uiuc server.
On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> wrote:
> Can you compare your results with another approach: simply do not
> instrument the top 1% hottest functions (by function entry count)? If this
> simple approach provides most of the benefits (my measurements on one
> codebase I tested show that it would eliminate over 97% of total function
> counts), we may be able to use a simpler approach.
>
> The biggest thing I notice about this proposal is that although the focus
> of the proposal is on reducing profiling overhead, it requires using a
> different pass pipeline during *both* the "instr-generate" compilation and
> the "instr-use" compilation. Therefore it is more than simply a reduction
> in profiling overhead -- it can also change performance even ignoring
> adding the profiling information in the IR. I think it would be interesting
> to test the modified pass pipeline even in the absence of profile
> information to better understand its effects in isolation (for example,
> maybe it would be good to add these passes during -O3 regardless of whether
> we are doing PGO).
>
> -- Sean Silva
>
> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote:
>
>> Instrumentation based Profile Guided Optimization (PGO) is a compiler
>> technique that leverages important program runtime information, such as
>> precise edge counts and frequent value information, to make frequently
>> executed code run faster. It's proven to be one of the most effective ways
>> to improve program performance.
>>
>> An important design point of PGO is to decide where to place the
>> instrumentation. In current LLVM PGO, the instrumentation is done in the
>> Clang Front-End (will be referred as FE based instrumentation). Doing this
>> early in the front-end gives better coverage information as the there are
>> more precise line number information. The resulted profile also has
>> relatively high tolerance to compiler changes. All the compiler change
>> after the instrumentation point will not lead to mismatched IR that
>> invalidates the profile.
>>
>> On the other hand, doing this too early gives the compiler fewer
>> opportunities to optimize the code before instrumentation. It has
>> significant impact on instrumentation runtime performance. In addition, it
>> tends to produce larger binary and profile data file. Our internal C++
>> benchmarks shows that FE based instrumentation degrades the performance
>> (compared to non-instrumented version) by 58.3%, and in some extreme cases,
>> the application speed/throughput decreases by 95% (21x runtime slowdown).
>>
>> Running the instrumented binary too slow is not desirable in PGO for the
>> following reasons:
>> * This slows down already lengthy build time. In the worst case, the
>> instrumented binary is so slow that it fails to run a representative
>> workload, because slow execution can leads to more time-outs in many server
>> programs. Other typical issues include: text size too big, failure to link
>> instrumented binaries, memory usage exceeding system limits.
>> * Slow runtime affects the program behavior. Real applications
>> sometimes monitor the program runtime and have different execution path
>> when the program run too slow. This would defeat the underlying assumption
>> of PGO and make it less effective.
>>
>> This work proposes an option to turn on middle-end based instrumentation
>> (new) that aims to speed up instrumentation runtime. The new
>> instrumentation is referred to as ME based instrumentation in this
>> document. Our experimental results show that ME instrumentation can speed
>> up the instrumentation by 80% on average for typical C++ programs. Here are
>> the two main design objectives:
>> * Co-existing with FE Instrumenter: We do not propose to replace the
>> FE based instrumentation because FE based instrumentation has its
>> advantages and applications. User can choose which phase to do
>> instrumentation via command line options.
>> * PGO Runtime Support Sharing: The ME instrumenter will completely
>> re-use the existing PGO’s runtime support.
>>
>> 1. FE Based Instrumentation Runtime Overhead Analysis
>>
>> Instrumented binaries are expected to run slower due to instrumentation
>> code inserted. With FE based instrumentation, the overhead is especially
>> high and runtime slowdown can be unacceptable in many cases. Further
>> analysis shows that there are 3 important factors contributing to FE
>> instrumentation slowdown :
>> * [Main] Redundant counter updates of inlined functions. C++ programs
>> can introduce large abstraction penalties by using lots of small inline
>> functions (assignment operators, getters, setters, ctors/dtors etc).
>> Overhead of instrumenting those small functions can be very large, making
>> training runs too slow and in some cases to usable;
>> * Non-optimal placement of the count updates;
>> * A third factor is related value profiling (to be turned on in the
>> future). Small and hot callee functions taking function pointer (callbacks)
>> can incur overhead due to indirect call target profiling.
>>
>>
>> 1.1 Redundant Counter Update
>>
>> If checking the assembly of the instrumented binary generated by current
>> LLVM implementation, we can find many sequence of consecutive 'incq'
>> instructions that updating difference counters in the same basic block. As
>> an example that extracted from real binary:
>> ...
>> incq 0xa91d80(%rip) # 14df4b8
>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
>> incq 0xa79011(%rip) # 14c6750
>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
>> incq 0xa79442(%rip) # 14c6b88
>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
>> incq 0x9c288b(%rip) # 140ffd8
>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
>> ...
>>
>> From profile use point of view, many of these counter update are
>> redundant. Considering the following example:
>> void bar(){
>> sum++;
>> }
>> void foo() {
>> bar();
>> }
>>
>> FE based instrumentation needs to insert counter update for the only BB
>> of the bar().
>> bar: # @bar
>> # BB#0: # %entry
>> incq .L__llvm_profile_counters_bar(%rip)
>> incl sum(%rip)
>> retq
>>
>> It also need to insert the update the BB in function foo(). After
>> inlining bar to foo(), the code is:
>> foo: # @foo
>> # BB#0: # %entry
>> incq .L__llvm_profile_counters_foo(%rip)
>> incq .L__llvm_profile_counters_bar(%rip)
>> incl sum(%rip)
>> retq
>>
>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
>> redundant -- the counter won't help downstream optimizations. On the other
>> hand, if bar() is a large function and may not be suitable to be inlined
>> for all callsites, this counter updated is necessary in order to produce
>> more accurate profile data for the out-of-line instance of the callee.
>>
>> If foo() is a hot function, the overhead of updating two counters can be
>> significant. This is especially bad for C++ program where there are tons of
>> small inline functions.
>>
>> There is another missing opportunity in FE based instrumentation. The
>> small functions’ control flow can usually be simplified when they are
>> inlined into caller contexts. Once the control flow is simplified, many
>> counter updates can therefore be eliminated. This is only possible for a
>> middle end based late instrumenter. Defining a custom clean-up pass to
>> remove redundant counter update is unrealistic and cannot be done in a sane
>> way without destroying the profile integrity of neither the out-of-line nor
>> inline instances of the callee.
>>
>> A much simpler and cleaner solution is to do a pre-inline pass to inline
>> all the trivial inlines before instrumentation. In addition to removing
>> the unnecessary count updates for the inline instances, another advantage
>> of pre-inline is to provide context sensitive profile for these small
>> inlined functions. This context senstive profile can further improve the
>> PGO based optimizations. Here is a contrived example:
>> void bar (int n) {
>> if (n&1)
>> do_sth1();
>> else
>> do_sth2();
>> }
>>
>> void caller() {
>> int s = 1;
>> for (; s<100; s+=2)
>> bar(s);
>>
>> for (s = 102; s< 200; s+=2)
>> bar(s);
>> }
>>
>> The direction of the branch inside bar will be totally opposite between
>> two different callsites in ‘caller’. Without pre-inlining, the branch
>> probability will be 50-50 which will be useless for later optimizations.
>> With pre-inlining, the profile will have the perfect branch count for each
>> callsite. The positive performance impact of context sensitive profiling
>> due to pre-inlining has been observed in real world large C++ programs.
>> Supporting context sensitive profiling is another way to solve this, but it
>> will introduce large additional runtime/memory overhead.
>>
>>
>> 1.2 Non-optimal placement of count update
>>
>> Another much smaller showdown factor is the placement of the counter
>> updates. Current front-end based instrumentation applies the
>> instrumentation to each front-end lexical construct. It also minimizes the
>> number of static instrumentations. Note that it always instruments the
>> entry count of the CFG. This may result in higher dynamic instruction
>> counts. For example,
>> BB0
>> | 100
>> BB1
>> 90 / \ 10
>> BB2 BB3
>> 90 \ / 10
>> BB4
>> Like the the above example, FE based instrumentation will always insert
>> count update in BB0. The dynamic instrumentation count will be either 110
>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better
>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic
>> instrumentation count is 100.
>>
>> Our experimental shows that the optimal placement based on edge hotness
>> can improve instrumented code performance by about 10%. While it’s hard to
>> find the optimal placement of count update, compiler heuristics can be
>> used the get the better placement. These heuristics can be based on static
>> profile prediction or user annotations (like __buildin_expect) to estimate
>> the relative edge hotness and put instrumentations on the less hot edges.
>> The initial late instrumentation has not fully implemented this placement
>> strategy yet. With that implemented, we expect even better results than
>> what is reported here. For real world programs, another major source of the
>> slowdown is the data racing and false sharing of the counter update for
>> highly threaded programs. Pre-inlining can alleviate this problem as the
>> counters in the inline instances are not longer shared. But the complete
>> solution to the data racing issue is orthogonal to the problem we try to
>> solve here.
>>
>>
>> 2. High Level Design
>>
>> We propose to perform a pre-profile inline pass before the PGO
>> instrumentation pass. Since the instrumentation pass is after inine, it has
>> to be done in the middle-end.
>>
>> (1) The pre-inline pass
>> We will invoke a pre-inline pass before the instrumentation. When PGO is
>> on, the inlining will be split into two passes:
>> * A pre-inline pass that is scheduled before the profile
>> instrumentation/annotation
>> * A post-inline pass which is the regular inline pass after
>> instrumentation/annotation
>> By design, all beneficial callsites without requiring profile data should
>> be inlined in the pre-inline pass. It includes all callsites that will
>> shrink code size after inlining. All the remaining callsites will be left
>> to the regular inline pass when profile data is available.
>>
>> After pre-inline, a CFG based profile instrumentation/annotation will be
>> done. A minimum weight spanning tree (MST) in CFG is first computed, then
>> only the edges not in the MST will be instrumented. The counter update
>> instructions are placed in the basic blocks.
>>
>> (2) Minimum Spanning Tree (MST) based instrumentation
>> A native way of instrumentation is to insert a count update for every
>> edge in CFG which will result in too many redundant updates that makes the
>> runtime very slow. Knuth [1] proposed a minimum spanning tree based method:
>> given a CFG, first compute a spanning tree. All edges that not in the MST
>> will be instrumented. In the profile use compilation, the counters are
>> populated (from the leaf of the spanning tree) to all the edges. Knuth
>> proved this method inserts the minimum number of instrumentations. MST
>> based method only guarantees the number static instrumentation are
>> minimized, not the dynamic instance of instrumentation. To reduce the
>> number of dynamic instrumentation, edges of potentially high counts will be
>> put into MST first so that they will have less chance to be instrumented.
>>
>>
>> 3. Experimental Results
>>
>> 3.1 Measurement of the efficiency of instrumentation
>> Other than the runtime of the instrumented binaries, a more direct
>> measurement of the instrumentation overhead is the the sum of the raw
>> profile count values. Note that regardless what kind of instrumentations
>> are used, the raw profile count should be able to reconstruct all the edge
>> count values for the whole program. All raw profile value are obtained via
>> incrementing the counter variable value by one. The sum of the raw profile
>> count value is roughly the dynamic instruction count of the instrumented
>> code. The lower of the value, the more efficient of the instrumentation.
>>
>>
>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K
>> eon
>> The performance speedup is computed by (FE_instrumentation_runtime /
>> ME_instrumentation_runtime - 1)
>>
>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from
>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small
>> ups and downs across different programs. Late instrumentation improves
>> hmmer a lot, but it is probably due to unrelated loop optimizations (90% of
>> runtime spent in one loop nest).
>>
>> For C++ programs, the performance impact of late instrumentation is very
>> large, which is as expected. The following table shows the result. For
>> some C++ program, the time speedup is huge. For example, in 483.xalancbmk,
>> late instrumentation speeds up performance by ~60%. Among all the SPEC C++
>> programs, only 444.namd is an outlier -- it uses a lot of macros and is a
>> very C like program.
>>
>> Program Speedup
>> 471.omnetpp 16.03%
>> 473.astar 5.00%
>> 483.xalancbmk 58.57%
>> 444.namd -0.90%
>> 447.dealII 60.47%
>> 450.soplex 8.20%
>> 453.povray 11.34%
>> 252.eon 35.33%
>> -------------------------
>> Geomean 21.01%
>>
>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
>> We also collect some statistic of the profiles generated by FE based
>> instrumentation and late instrumentation, namely, the following information:
>> 1. the number of functions that being instrumented,
>> 2. the result profile file size,
>> 3. the sum of raw count values that was mentioned earlier -- we used
>> it to measure the efficiency of the instrumentation.
>> Next table shows the ratios of the each metrics by late instrumentation
>> for the C++ programs, with FE based instrumentation as the base : column
>> (1) shows the ratios of instrumented functions; column (2) shows the ratios
>> of the profile file size; column (3) shows the ratios of the sum of raw
>> count values.
>>
>> (1) (2) (3)
>> 471.omnetpp 85.36% 110.26% 46.52%
>> 473.astar 64.86% 72.72% 63.13%
>> 483.xalancbmk 51.83% 56.11% 35.77%
>> 444.namd 75.36% 82.82% 85.77%
>> 447.dealII 43.42% 46.46% 26.75%
>> 450.soplex 71.80% 87.54% 51.19%
>> 453.povray 78.68% 83.57% 64.37%
>> 252.eon 72.06% 91.22% 30.02%
>> ----------------------------------------
>> Geomean 66.50% 76.36% 47.01%
>>
>>
>> For FE based instrumentation, profile count variables generated for the
>> dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data,
>> and __llvm_prf_cnts) from the data/text segment, nor in the result
>> profile. There is a recent patch that removes these unused data for COMDAT
>> functions, but that patch won’t touch regular functions. This is the main
>> reason for the larger number of instrumented functions and larger profile
>> file size for the FE based instrumentation. The reduction of the sum of raw
>> count values is mainly due to the elimination of redundant profile updates
>> enabled by the pre-inlining.
>>
>> For C programs, we observe similar improvement in the profile size
>> (geomean ratio of 73.75%) and smaller improvement in the number of
>> instrumented functions (geomean ratio of 87.49%) and the sum of raw count
>> values (geomean of 82.76%).
>>
>>
>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++
>> benchmarks
>>
>> We also use Google internal benchmarks (mostly typical C++ applications)
>> to measure the relative performance b/w FE based instrumentation and late
>> instrumentation. The following table shows the speedup of late
>> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a
>> very large multi-threaded C++ program. Late instrumentation sees 4x
>> speedup. Larger than 3x speedups are also seen in many other programs.
>>
>> C++_bencharmk01 416.98%
>> C++_bencharmk02 6.29%
>> C++_bencharmk03 22.39%
>> C++_bencharmk04 28.05%
>> C++_bencharmk05 2.00%
>> C++_bencharmk06 675.89%
>> C++_bencharmk07 359.19%
>> C++_bencharmk08 395.03%
>> C_bencharmk09 15.11%
>> C_bencharmk10 5.47%
>> C++_bencharmk11 5.73%
>> C++_bencharmk12 2.31%
>> C++_bencharmk13 87.73%
>> C++_bencharmk14 7.22%
>> C_bencharmk15 -0.51%
>> C++_bencharmk16 59.15%
>> C++_bencharmk17 358.82%
>> C++_bencharmk18 861.36%
>> C++_bencharmk19 29.62%
>> C++_bencharmk20 11.82%
>> C_bencharmk21 0.53%
>> C++_bencharmk22 43.10%
>> ---------------------------
>> Geomean 83.03%
>>
>>
>> 3.5 Statistics of LLVM profiles for Google internal benchmarks
>>
>> The following shows the profile statics using Google internal benchmarks.
>> (1) (2) (3)
>> C++_bencharmk01 36.84% 40.29% 2.32%
>> C++_bencharmk02 39.20% 40.49% 42.39%
>> C++_bencharmk03 39.37% 40.65% 23.24%
>> C++_bencharmk04 39.13% 40.68% 17.70%
>> C++_bencharmk05 36.58% 38.27% 51.08%
>> C++_bencharmk06 29.50% 27.87% 2.87%
>> C++_bencharmk07 29.50% 27.87% 1.73%
>> C++_bencharmk08 29.50% 27.87% 4.17%
>> C_bencharmk09 53.95% 68.00% 11.18%
>> C_bencharmk10 53.95% 68.00% 31.74%
>> C++_bencharmk11 36.40% 37.07% 46.12%
>> C++_bencharmk12 38.44% 41.90% 73.59%
>> C++_bencharmk13 39.28% 42.72% 29.56%
>> C++_bencharmk14 38.59% 42.20% 13.42%
>> C_bencharmk15 57.45% 48.50% 66.99%
>> C++_bencharmk16 36.86% 42.18% 16.53%
>> C++_bencharmk17 37.82% 39.77% 13.68%
>> C++_bencharmk18 37.82% 39.77% 7.96%
>> C++_bencharmk19 37.52% 40.46% 1.85%
>> C++_bencharmk20 32.37% 30.44% 19.69%
>> C_bencharmk21 37.63% 40.42% 88.81%
>> C++_bencharmk22 36.28% 36.92% 21.62%
>> --------------------------------------------------
>> Geomean 38.22% 39.96% 15.58%
>>
>>
>> 4. Implementation Details:
>>
>> We need to add new option(s) for the alternative PGO instrumentation pass
>> in the middle end. It can in one of the following forms:
>>
>> (1) Complete new options that are on par with current PGO options:
>> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation,
>> and -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
>> (2) Or, late instrumentation can be turned on with an additional
>> option -fprofile-instr-late with current PGO options. I. e.
>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO
>> instrumentation, and -fprofile-instr-late
>> -fprofile-instr-use[=<profile_file>]? for PGO use.
>> (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO
>> instrumentation. Adding a magic tag in profile so that FE based profile and
>> late instrumented profile can be automatically detected by profile loader
>> In PGO use compilation. This requires a slight profile format change.
>>
>> In our prototype implementation, two new passes are added in the
>> beginning of PassManagerBuilder::populateModulePassManager(), namely
>> PreProfileInlinerPass and PGOLateInstrumentationPass.
>>
>>
>> 4.1 Pre-inline pass:
>>
>> It is controlled by back-end option "-preinline" and
>> "-disable-preinline". If the user specifies any llvm option of
>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be
>> automatically inserted in the driver.. To disable the pre-inliner when late
>> instrumentation is enabled, use option "-mllvm -disable-preinline".
>>
>> For now, only minimum tuning is done for the pre-inliner, which simply
>> adjusts the inline threshold: If -Oz is specified, the threshold is set to
>> 25. Otherwise, it is 75.
>>
>> The following clean up passes are added to PassManager, right after the
>> PreProfileInline pass:
>> createEarlyCSEPass()
>> createJumpThreadingPass()
>> createCorrelatedValuePropagationPass()
>> createCFGSimplificationPass()
>> createInstructionCombiningPass()
>> createGVNPass(DisableGVNLoadPRE)
>> createPeepholePASS()
>> Some of them might not be necessary.
>>
>> 4.2 Late Instrumentation Pass:
>> The late instrumentation is right after the pre-inline pass and it's
>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and
>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will
>> provide the profile name.
>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass()
>> is also added to PassManager to lower the instrumentation intrinsics
>>
>> PGOLateInstrumeatnion is a module pass that applies the instrumentation
>> to each function by class PGOLateInstrumentationFunc. For each function,
>> perform the following steps:
>> 1. First collect all the CFG edges. Assign an estimated weight to each
>> edge. Critical edges and back-edges are assigned to high value of weights.
>> One fake node and a few fake edges (from the fake node to the entry node,
>> and from all the exit nodes to the fake node) are also added to the
>> worklist.
>> 2. Construct the MST. The edges with the higher weight will be put to
>> MST first, unless it forms a cycle.
>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the index
>> of each BB.
>> The above three steps are the same for profile-generate and profile-use
>> compilation.
>>
>> In the next step, for profile-generation compilation, all the edges that
>> not in the MST are instrumented. If this is a critical edge, split the edge
>> first. The actual instrumentation is to generate
>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic
>> will be lowed by pass createInstrProfilingPass().
>>
>> In the next step, for profile-generation compilation, all the edges that
>> not in the MST are instrumented. If this is a critical edge, split the edge
>> first. The actual instrumentation is to generate
>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic
>> will be lowed by pass createInstrProfilingPass().
>>
>> For -fprofile-use compilation, first read in the counters and the CFG
>> hash from the profile file. If the CFG hash matches, populate the counters
>> to all the edges in reverse topological order of the MST. Once having all
>> the edge counts, set the branch weights metadata for the IR having multiple
>> branches. Also apply the cold/hot function attributes based on function
>> level counts.
>>
>>
>> 4.3 Profile Format:
>>
>> The late instrumentation profile is mostly the same as the one from
>> front-end instrument-ion. The difference is
>> * Function checksums are different.
>> * Function entry counts are no longer available.
>> For llvm-profdata utility, options -lateinstr needs to be used to
>> differentiate FE based and late instrumentation profiles, unless a magic
>> tag is added to the profile.
>>
>>
>> 5. References:
>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points
>> for program frequency counts. BIT Numerical Mathematics 1973, Volume 13,
>> Issue 3, pp 313-322
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150807/0104ecd7/attachment-0001.html>
More information about the llvm-dev
mailing list