[llvm-dev] RFC: PGO Late instrumentation for LLVM
Sean Silva via llvm-dev
llvm-dev at lists.llvm.org
Sun Aug 9 18:23:20 PDT 2015
On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <davidxl at google.com>
wrote:
> On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> wrote:
> > Accidentally sent to uiuc server.
> >
> >
> > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com>
> wrote:
> >>
> >> Can you compare your results with another approach: simply do not
> >> instrument the top 1% hottest functions (by function entry count)? If
> this
> >> simple approach provides most of the benefits (my measurements on one
> >> codebase I tested show that it would eliminate over 97% of total
> function
> >> counts), we may be able to use a simpler approach.AFor large C++
> programs, the
>
> For static compiler, this is not possible. It also seem to defeat the
> purpose of PGO -- hottest functions are those which need profile
> guidance the most.
>
In the program I looked at the top 1% were just trivial getters and
constructors or similar. We should already be "getting these right".
Stuff like:
class Foo {
....
Foo(int bar) : m_bar(bar) {}
int getBar() { return m_bar; }
...
};
Are the results different for your codebases? Have you tried something like
simply not instrumenting the hottest 1% or 0.5% of functions? (maybe
restrict the instrumentation skipping to functions of just a single BB with
less than, say, 10 instructions).
Rong's approach is quite sophisticated; I'm just interested on getting a
sanity check against a "naive" approach to see how much the sophisticated
approach is buying us.
>
> >>
> >> The biggest thing I notice about this proposal is that although the
> focus
> >> of the proposal is on reducing profiling overhead, it requires using a
> >> different pass pipeline during *both* the "instr-generate" compilation
> and
> >> the "instr-use" compilation. Therefore it is more than simply a
> reduction in
> >> profiling overhead -- it can also change performance even ignoring
> adding
> >> the profiling information in the IR. I think it would be interesting to
> test
> >> the modified pass pipeline even in the absence of profile information to
> >> better understand its effects in isolation (for example, maybe it would
> be
> >> good to add these passes during -O3 regardless of whether we are doing
> PGO).
> >>
>
> The pipeline change is very PGO specific. I expect it has very little
> impact on regular compilations:
> 1) LLVM's bottom up inliner is already iterative.
> 2) The performance impact (on instrumented build) can be 4x large --
> which is unlikely for any nonPGO pipeline change.
>
With respect to adding extra passes, I'm actually more concerned about the
non-instrumented build, for which Rong did not show any data. For example,
will users find their program is X% faster than no PGO with ("ME") PGO, but
really (X/2)% of that is due simply to the extra passes, and not any
profile guidance? We should then prefer to use the extra passes during
regular -O3 builds. Conversely, if they find that their program is X%
faster with ("ME") PGO, but the extra passes are making the program (X/2)%
slower, then the users could be getting (3X/2)% faster instead. I am only
concerned about having two variables change simultaneously; I think that
instrumenting after some amount of cleanup has been done makes a lot of
sense.
Could Rong's proposal be made to work within the existing pipeline, but
doing the instrumentation after a subset of the existing pass pipeline has
been run?
-- Sean Silva
>
> LLVM already supports running SCC passes iteratively, so experiment
> like this will be easy to do -- the data can be collected.
>
> thanks,
>
> David
>
> >> -- Sean Silva
> >>
> >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote:
> >>>
> >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler
> >>> technique that leverages important program runtime information, such as
> >>> precise edge counts and frequent value information, to make frequently
> >>> executed code run faster. It's proven to be one of the most effective
> ways
> >>> to improve program performance.
> >>>
> >>> An important design point of PGO is to decide where to place the
> >>> instrumentation. In current LLVM PGO, the instrumentation is done in
> the
> >>> Clang Front-End (will be referred as FE based instrumentation). Doing
> this
> >>> early in the front-end gives better coverage information as the there
> are
> >>> more precise line number information. The resulted profile also has
> >>> relatively high tolerance to compiler changes. All the compiler change
> after
> >>> the instrumentation point will not lead to mismatched IR that
> invalidates
> >>> the profile.
> >>>
> >>> On the other hand, doing this too early gives the compiler fewer
> >>> opportunities to optimize the code before instrumentation. It has
> >>> significant impact on instrumentation runtime performance. In
> addition, it
> >>> tends to produce larger binary and profile data file. Our internal
> C++
> >>> benchmarks shows that FE based instrumentation degrades the performance
> >>> (compared to non-instrumented version) by 58.3%, and in some extreme
> cases,
> >>> the application speed/throughput decreases by 95% (21x runtime
> slowdown).
> >>>
> >>> Running the instrumented binary too slow is not desirable in PGO for
> the
> >>> following reasons:
> >>> * This slows down already lengthy build time. In the worst case, the
> >>> instrumented binary is so slow that it fails to run a representative
> >>> workload, because slow execution can leads to more time-outs in many
> server
> >>> programs. Other typical issues include: text size too big, failure to
> link
> >>> instrumented binaries, memory usage exceeding system limits.
> >>> * Slow runtime affects the program behavior. Real applications
> >>> sometimes monitor the program runtime and have different execution
> path when
> >>> the program run too slow. This would defeat the underlying assumption
> of PGO
> >>> and make it less effective.
> >>>
> >>> This work proposes an option to turn on middle-end based
> instrumentation
> >>> (new) that aims to speed up instrumentation runtime. The new
> instrumentation
> >>> is referred to as ME based instrumentation in this document. Our
> >>> experimental results show that ME instrumentation can speed up the
> >>> instrumentation by 80% on average for typical C++ programs. Here are
> the two
> >>> main design objectives:
> >>> * Co-existing with FE Instrumenter: We do not propose to replace the
> >>> FE based instrumentation because FE based instrumentation has its
> advantages
> >>> and applications. User can choose which phase to do instrumentation via
> >>> command line options.
> >>> * PGO Runtime Support Sharing: The ME instrumenter will completely
> >>> re-use the existing PGO’s runtime support.
> >>>
> >>> 1. FE Based Instrumentation Runtime Overhead Analysis
> >>>
> >>> Instrumented binaries are expected to run slower due to instrumentation
> >>> code inserted. With FE based instrumentation, the overhead is
> especially
> >>> high and runtime slowdown can be unacceptable in many cases. Further
> >>> analysis shows that there are 3 important factors contributing to FE
> >>> instrumentation slowdown :
> >>> * [Main] Redundant counter updates of inlined functions. C++
> programs
> >>> can introduce large abstraction penalties by using lots of small inline
> >>> functions (assignment operators, getters, setters, ctors/dtors etc).
> >>> Overhead of instrumenting those small functions can be very large,
> making
> >>> training runs too slow and in some cases to usable;
> >>> * Non-optimal placement of the count updates;
> >>> * A third factor is related value profiling (to be turned on in the
> >>> future). Small and hot callee functions taking function pointer
> (callbacks)
> >>> can incur overhead due to indirect call target profiling.
> >>>
> >>>
> >>> 1.1 Redundant Counter Update
> >>>
> >>> If checking the assembly of the instrumented binary generated by
> current
> >>> LLVM implementation, we can find many sequence of consecutive 'incq'
> >>> instructions that updating difference counters in the same basic
> block. As
> >>> an example that extracted from real binary:
> >>> ...
> >>> incq 0xa91d80(%rip) # 14df4b8
> >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
> >>> incq 0xa79011(%rip) # 14c6750
> >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
> >>> incq 0xa79442(%rip) # 14c6b88
> >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
> >>> incq 0x9c288b(%rip) # 140ffd8
> >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
> >>> ...
> >>>
> >>> From profile use point of view, many of these counter update are
> >>> redundant. Considering the following example:
> >>> void bar(){
> >>> sum++;
> >>> }
> >>> void foo() {
> >>> bar();
> >>> }
> >>>
> >>> FE based instrumentation needs to insert counter update for the only BB
> >>> of the bar().
> >>> bar: # @bar
> >>> # BB#0: # %entry
> >>> incq .L__llvm_profile_counters_bar(%rip)
> >>> incl sum(%rip)
> >>> retq
> >>>
> >>> It also need to insert the update the BB in function foo(). After
> >>> inlining bar to foo(), the code is:
> >>> foo: # @foo
> >>> # BB#0: # %entry
> >>> incq .L__llvm_profile_counters_foo(%rip)
> >>> incq .L__llvm_profile_counters_bar(%rip)
> >>> incl sum(%rip)
> >>> retq
> >>>
> >>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip)
> is
> >>> redundant -- the counter won't help downstream optimizations. On the
> other
> >>> hand, if bar() is a large function and may not be suitable to be
> inlined for
> >>> all callsites, this counter updated is necessary in order to produce
> more
> >>> accurate profile data for the out-of-line instance of the callee.
> >>>
> >>> If foo() is a hot function, the overhead of updating two counters can
> be
> >>> significant. This is especially bad for C++ program where there are
> tons of
> >>> small inline functions.
> >>>
> >>> There is another missing opportunity in FE based instrumentation. The
> >>> small functions’ control flow can usually be simplified when they are
> >>> inlined into caller contexts. Once the control flow is simplified, many
> >>> counter updates can therefore be eliminated. This is only possible for
> a
> >>> middle end based late instrumenter. Defining a custom clean-up pass to
> >>> remove redundant counter update is unrealistic and cannot be done in a
> sane
> >>> way without destroying the profile integrity of neither the
> out-of-line nor
> >>> inline instances of the callee.
> >>>
> >>> A much simpler and cleaner solution is to do a pre-inline pass to
> inline
> >>> all the trivial inlines before instrumentation. In addition to
> removing the
> >>> unnecessary count updates for the inline instances, another advantage
> of
> >>> pre-inline is to provide context sensitive profile for these small
> inlined
> >>> functions. This context senstive profile can further improve the PGO
> based
> >>> optimizations. Here is a contrived example:
> >>> void bar (int n) {
> >>> if (n&1)
> >>> do_sth1();
> >>> else
> >>> do_sth2();
> >>> }
> >>>
> >>> void caller() {
> >>> int s = 1;
> >>> for (; s<100; s+=2)
> >>> bar(s);
> >>>
> >>> for (s = 102; s< 200; s+=2)
> >>> bar(s);
> >>> }
> >>>
> >>> The direction of the branch inside bar will be totally opposite between
> >>> two different callsites in ‘caller’. Without pre-inlining, the branch
> >>> probability will be 50-50 which will be useless for later
> optimizations.
> >>> With pre-inlining, the profile will have the perfect branch count for
> each
> >>> callsite. The positive performance impact of context sensitive
> profiling due
> >>> to pre-inlining has been observed in real world large C++ programs.
> >>> Supporting context sensitive profiling is another way to solve this,
> but it
> >>> will introduce large additional runtime/memory overhead.
> >>>
> >>>
> >>> 1.2 Non-optimal placement of count update
> >>>
> >>> Another much smaller showdown factor is the placement of the counter
> >>> updates. Current front-end based instrumentation applies the
> instrumentation
> >>> to each front-end lexical construct. It also minimizes the number of
> static
> >>> instrumentations. Note that it always instruments the entry count of
> the
> >>> CFG. This may result in higher dynamic instruction counts. For example,
> >>> BB0
> >>> | 100
> >>> BB1
> >>> 90 / \ 10
> >>> BB2 BB3
> >>> 90 \ / 10
> >>> BB4
> >>> Like the the above example, FE based instrumentation will always insert
> >>> count update in BB0. The dynamic instrumentation count will be either
> 110
> >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A
> better
> >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the
> dynamic
> >>> instrumentation count is 100.
> >>>
> >>> Our experimental shows that the optimal placement based on edge hotness
> >>> can improve instrumented code performance by about 10%. While it’s
> hard to
> >>> find the optimal placement of count update, compiler heuristics can
> be used
> >>> the get the better placement. These heuristics can be based on static
> >>> profile prediction or user annotations (like __buildin_expect) to
> estimate
> >>> the relative edge hotness and put instrumentations on the less hot
> edges.
> >>> The initial late instrumentation has not fully implemented this
> placement
> >>> strategy yet. With that implemented, we expect even better results
> than
> >>> what is reported here. For real world programs, another major source
> of the
> >>> slowdown is the data racing and false sharing of the counter update for
> >>> highly threaded programs. Pre-inlining can alleviate this problem as
> the
> >>> counters in the inline instances are not longer shared. But the
> complete
> >>> solution to the data racing issue is orthogonal to the problem we try
> to
> >>> solve here.
> >>>
> >>>
> >>> 2. High Level Design
> >>>
> >>> We propose to perform a pre-profile inline pass before the PGO
> >>> instrumentation pass. Since the instrumentation pass is after inine,
> it has
> >>> to be done in the middle-end.
> >>>
> >>> (1) The pre-inline pass
> >>> We will invoke a pre-inline pass before the instrumentation. When PGO
> is
> >>> on, the inlining will be split into two passes:
> >>> * A pre-inline pass that is scheduled before the profile
> >>> instrumentation/annotation
> >>> * A post-inline pass which is the regular inline pass after
> >>> instrumentation/annotation
> >>> By design, all beneficial callsites without requiring profile data
> should
> >>> be inlined in the pre-inline pass. It includes all callsites that will
> >>> shrink code size after inlining. All the remaining callsites will be
> left to
> >>> the regular inline pass when profile data is available.
> >>>
> >>> After pre-inline, a CFG based profile instrumentation/annotation will
> be
> >>> done. A minimum weight spanning tree (MST) in CFG is first computed,
> then
> >>> only the edges not in the MST will be instrumented. The counter update
> >>> instructions are placed in the basic blocks.
> >>>
> >>> (2) Minimum Spanning Tree (MST) based instrumentation
> >>> A native way of instrumentation is to insert a count update for every
> >>> edge in CFG which will result in too many redundant updates that
> makes the
> >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based
> method:
> >>> given a CFG, first compute a spanning tree. All edges that not in the
> MST
> >>> will be instrumented. In the profile use compilation, the counters are
> >>> populated (from the leaf of the spanning tree) to all the edges. Knuth
> >>> proved this method inserts the minimum number of instrumentations. MST
> based
> >>> method only guarantees the number static instrumentation are
> minimized, not
> >>> the dynamic instance of instrumentation. To reduce the number of
> dynamic
> >>> instrumentation, edges of potentially high counts will be put into MST
> first
> >>> so that they will have less chance to be instrumented.
> >>>
> >>>
> >>> 3. Experimental Results
> >>>
> >>> 3.1 Measurement of the efficiency of instrumentation
> >>> Other than the runtime of the instrumented binaries, a more direct
> >>> measurement of the instrumentation overhead is the the sum of the raw
> >>> profile count values. Note that regardless what kind of
> instrumentations are
> >>> used, the raw profile count should be able to reconstruct all the edge
> count
> >>> values for the whole program. All raw profile value are obtained via
> >>> incrementing the counter variable value by one. The sum of the raw
> profile
> >>> count value is roughly the dynamic instruction count of the
> instrumented
> >>> code. The lower of the value, the more efficient of the
> instrumentation.
> >>>
> >>>
> >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and
> SPEC2K
> >>> eon
> >>> The performance speedup is computed by (FE_instrumentation_runtime /
> >>> ME_instrumentation_runtime - 1)
> >>>
> >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon
> from
> >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are
> small
> >>> ups and downs across different programs. Late instrumentation improves
> hmmer
> >>> a lot, but it is probably due to unrelated loop optimizations (90% of
> >>> runtime spent in one loop nest).
> >>>
> >>> For C++ programs, the performance impact of late instrumentation is
> very
> >>> large, which is as expected. The following table shows the result.
> For some
> >>> C++ program, the time speedup is huge. For example, in 483.xalancbmk,
> late
> >>> instrumentation speeds up performance by ~60%. Among all the SPEC C++
> >>> programs, only 444.namd is an outlier -- it uses a lot of macros and
> is a
> >>> very C like program.
> >>>
> >>> Program Speedup
> >>> 471.omnetpp 16.03%
> >>> 473.astar 5.00%
> >>> 483.xalancbmk 58.57%
> >>> 444.namd -0.90%
> >>> 447.dealII 60.47%
> >>> 450.soplex 8.20%
> >>> 453.povray 11.34%
> >>> 252.eon 35.33%
> >>> -------------------------
> >>> Geomean 21.01%
> >>>
> >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
> >>> We also collect some statistic of the profiles generated by FE based
> >>> instrumentation and late instrumentation, namely, the following
> information:
> >>> 1. the number of functions that being instrumented,
> >>> 2. the result profile file size,
> >>> 3. the sum of raw count values that was mentioned earlier -- we used
> >>> it to measure the efficiency of the instrumentation.
> >>> Next table shows the ratios of the each metrics by late instrumentation
> >>> for the C++ programs, with FE based instrumentation as the base :
> column (1)
> >>> shows the ratios of instrumented functions; column (2) shows the
> ratios of
> >>> the profile file size; column (3) shows the ratios of the sum of raw
> count
> >>> values.
> >>>
> >>> (1) (2) (3)
> >>> 471.omnetpp 85.36% 110.26% 46.52%
> >>> 473.astar 64.86% 72.72% 63.13%
> >>> 483.xalancbmk 51.83% 56.11% 35.77%
> >>> 444.namd 75.36% 82.82% 85.77%
> >>> 447.dealII 43.42% 46.46% 26.75%
> >>> 450.soplex 71.80% 87.54% 51.19%
> >>> 453.povray 78.68% 83.57% 64.37%
> >>> 252.eon 72.06% 91.22% 30.02%
> >>> ----------------------------------------
> >>> Geomean 66.50% 76.36% 47.01%
> >>>
> >>>
> >>> For FE based instrumentation, profile count variables generated for the
> >>> dead functions will not be removed (like __llvm_prf_names,
> __llvm_prf_data,
> >>> and __llvm_prf_cnts) from the data/text segment, nor in the result
> profile.
> >>> There is a recent patch that removes these unused data for COMDAT
> functions,
> >>> but that patch won’t touch regular functions. This is the main reason
> for
> >>> the larger number of instrumented functions and larger profile file
> size for
> >>> the FE based instrumentation. The reduction of the sum of raw count
> values
> >>> is mainly due to the elimination of redundant profile updates enabled
> by the
> >>> pre-inlining.
> >>>
> >>> For C programs, we observe similar improvement in the profile size
> >>> (geomean ratio of 73.75%) and smaller improvement in the number of
> >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw
> count
> >>> values (geomean of 82.76%).
> >>>
> >>>
> >>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++
> >>> benchmarks
> >>>
> >>> We also use Google internal benchmarks (mostly typical C++
> applications)
> >>> to measure the relative performance b/w FE based instrumentation and
> late
> >>> instrumentation. The following table shows the speedup of late
> >>> instrumentation vs FE based instrumentation. Note that C++benchmark01
> is a
> >>> very large multi-threaded C++ program. Late instrumentation sees 4x
> speedup.
> >>> Larger than 3x speedups are also seen in many other programs.
> >>>
> >>> C++_bencharmk01 416.98%
> >>> C++_bencharmk02 6.29%
> >>> C++_bencharmk03 22.39%
> >>> C++_bencharmk04 28.05%
> >>> C++_bencharmk05 2.00%
> >>> C++_bencharmk06 675.89%
> >>> C++_bencharmk07 359.19%
> >>> C++_bencharmk08 395.03%
> >>> C_bencharmk09 15.11%
> >>> C_bencharmk10 5.47%
> >>> C++_bencharmk11 5.73%
> >>> C++_bencharmk12 2.31%
> >>> C++_bencharmk13 87.73%
> >>> C++_bencharmk14 7.22%
> >>> C_bencharmk15 -0.51%
> >>> C++_bencharmk16 59.15%
> >>> C++_bencharmk17 358.82%
> >>> C++_bencharmk18 861.36%
> >>> C++_bencharmk19 29.62%
> >>> C++_bencharmk20 11.82%
> >>> C_bencharmk21 0.53%
> >>> C++_bencharmk22 43.10%
> >>> ---------------------------
> >>> Geomean 83.03%
> >>>
> >>>
> >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks
> >>>
> >>> The following shows the profile statics using Google internal
> benchmarks.
> >>> (1) (2) (3)
> >>> C++_bencharmk01 36.84% 40.29% 2.32%
> >>> C++_bencharmk02 39.20% 40.49% 42.39%
> >>> C++_bencharmk03 39.37% 40.65% 23.24%
> >>> C++_bencharmk04 39.13% 40.68% 17.70%
> >>> C++_bencharmk05 36.58% 38.27% 51.08%
> >>> C++_bencharmk06 29.50% 27.87% 2.87%
> >>> C++_bencharmk07 29.50% 27.87% 1.73%
> >>> C++_bencharmk08 29.50% 27.87% 4.17%
> >>> C_bencharmk09 53.95% 68.00% 11.18%
> >>> C_bencharmk10 53.95% 68.00% 31.74%
> >>> C++_bencharmk11 36.40% 37.07% 46.12%
> >>> C++_bencharmk12 38.44% 41.90% 73.59%
> >>> C++_bencharmk13 39.28% 42.72% 29.56%
> >>> C++_bencharmk14 38.59% 42.20% 13.42%
> >>> C_bencharmk15 57.45% 48.50% 66.99%
> >>> C++_bencharmk16 36.86% 42.18% 16.53%
> >>> C++_bencharmk17 37.82% 39.77% 13.68%
> >>> C++_bencharmk18 37.82% 39.77% 7.96%
> >>> C++_bencharmk19 37.52% 40.46% 1.85%
> >>> C++_bencharmk20 32.37% 30.44% 19.69%
> >>> C_bencharmk21 37.63% 40.42% 88.81%
> >>> C++_bencharmk22 36.28% 36.92% 21.62%
> >>> --------------------------------------------------
> >>> Geomean 38.22% 39.96% 15.58%
> >>>
> >>>
> >>> 4. Implementation Details:
> >>>
> >>> We need to add new option(s) for the alternative PGO instrumentation
> pass
> >>> in the middle end. It can in one of the following forms:
> >>>
> >>> (1) Complete new options that are on par with current PGO options:
> >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO
> Instrumentation, and
> >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
> >>> (2) Or, late instrumentation can be turned on with an additional
> >>> option -fprofile-instr-late with current PGO options. I. e.
> >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO
> >>> instrumentation, and -fprofile-instr-late
> >>> -fprofile-instr-use[=<profile_file>]? for PGO use.
> >>> (3) Alternatively to (2), only keep -fprofile-instr-late option in
> PGO
> >>> instrumentation. Adding a magic tag in profile so that FE based
> profile and
> >>> late instrumented profile can be automatically detected by profile
> loader In
> >>> PGO use compilation. This requires a slight profile format change.
> >>>
> >>> In our prototype implementation, two new passes are added in the
> >>> beginning of PassManagerBuilder::populateModulePassManager(), namely
> >>> PreProfileInlinerPass and PGOLateInstrumentationPass.
> >>>
> >>>
> >>> 4.1 Pre-inline pass:
> >>>
> >>> It is controlled by back-end option "-preinline" and
> >>> "-disable-preinline". If the user specifies any llvm option of
> >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will
> be
> >>> automatically inserted in the driver.. To disable the pre-inliner when
> late
> >>> instrumentation is enabled, use option "-mllvm -disable-preinline".
> >>>
> >>> For now, only minimum tuning is done for the pre-inliner, which simply
> >>> adjusts the inline threshold: If -Oz is specified, the threshold is
> set to
> >>> 25. Otherwise, it is 75.
> >>>
> >>> The following clean up passes are added to PassManager, right after the
> >>> PreProfileInline pass:
> >>> createEarlyCSEPass()
> >>> createJumpThreadingPass()
> >>> createCorrelatedValuePropagationPass()
> >>> createCFGSimplificationPass()
> >>> createInstructionCombiningPass()
> >>> createGVNPass(DisableGVNLoadPRE)
> >>> createPeepholePASS()
> >>> Some of them might not be necessary.
> >>>
> >>> 4.2 Late Instrumentation Pass:
> >>> The late instrumentation is right after the pre-inline pass and it's
> >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen"
> and
> >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver
> will
> >>> provide the profile name.
> >>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass()
> >>> is also added to PassManager to lower the instrumentation intrinsics
> >>>
> >>> PGOLateInstrumeatnion is a module pass that applies the instrumentation
> >>> to each function by class PGOLateInstrumentationFunc. For each
> function,
> >>> perform the following steps:
> >>> 1. First collect all the CFG edges. Assign an estimated weight to
> each
> >>> edge. Critical edges and back-edges are assigned to high value of
> weights.
> >>> One fake node and a few fake edges (from the fake node to the entry
> node,
> >>> and from all the exit nodes to the fake node) are also added to the
> >>> worklist.
> >>> 2. Construct the MST. The edges with the higher weight will be put
> to
> >>> MST first, unless it forms a cycle.
> >>> 3. Traverse the CFG and compute the CFG hash using CRC32 of the
> index
> >>> of each BB.
> >>> The above three steps are the same for profile-generate and profile-use
> >>> compilation.
> >>>
> >>> In the next step, for profile-generation compilation, all the edges
> that
> >>> not in the MST are instrumented. If this is a critical edge, split the
> edge
> >>> first. The actual instrumentation is to generate
> >>> Intrinsic::instrprof_increment() in the instrumented BB. This
> intrinsic will
> >>> be lowed by pass createInstrProfilingPass().
> >>>
> >>> In the next step, for profile-generation compilation, all the edges
> that
> >>> not in the MST are instrumented. If this is a critical edge, split the
> edge
> >>> first. The actual instrumentation is to generate
> >>> Intrinsic::instrprof_increment() in the instrumented BB. This
> intrinsic will
> >>> be lowed by pass createInstrProfilingPass().
> >>>
> >>> For -fprofile-use compilation, first read in the counters and the CFG
> >>> hash from the profile file. If the CFG hash matches, populate the
> counters
> >>> to all the edges in reverse topological order of the MST. Once having
> all
> >>> the edge counts, set the branch weights metadata for the IR having
> multiple
> >>> branches. Also apply the cold/hot function attributes based on function
> >>> level counts.
> >>>
> >>>
> >>> 4.3 Profile Format:
> >>>
> >>> The late instrumentation profile is mostly the same as the one from
> >>> front-end instrument-ion. The difference is
> >>> * Function checksums are different.
> >>> * Function entry counts are no longer available.
> >>> For llvm-profdata utility, options -lateinstr needs to be used to
> >>> differentiate FE based and late instrumentation profiles, unless a
> magic tag
> >>> is added to the profile.
> >>>
> >>>
> >>> 5. References:
> >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of
> points
> >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume
> 13,
> >>> Issue 3, pp 313-322
> >>>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150809/cc486e38/attachment.html>
More information about the llvm-dev
mailing list