[llvm-dev] RFC: PGO Late instrumentation for LLVM

Sun Aug 9 18:23:20 PDT 2015

On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <davidxl at google.com>
wrote:

> On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> wrote:
> > Accidentally sent to uiuc server.
> >
> >
> > On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com>
> wrote:
> >>
> >> Can you compare your results with another approach: simply do not
> >> instrument the top 1% hottest functions (by function entry count)? If
> this
> >> simple approach provides most of the benefits (my measurements on one
> >> codebase I tested show that it would eliminate over 97% of total
> function
> >> counts), we may be able to use a simpler approach.AFor large C++
> programs, the
>
> For static compiler, this is not possible. It also seem to defeat the
> purpose of PGO -- hottest functions are those which need profile
> guidance the most.
>

In the program I looked at the top 1% were just trivial getters and
constructors or similar. We should already be "getting these right".

Stuff like:

class Foo {
....
Foo(int bar) : m_bar(bar) {}
int getBar() { return m_bar; }
...
};

Are the results different for your codebases? Have you tried something like
simply not instrumenting the hottest 1% or 0.5% of functions? (maybe
restrict the instrumentation skipping to functions of just a single BB with
less than, say, 10 instructions).

Rong's approach is quite sophisticated; I'm just interested on getting a
sanity check against a "naive" approach to see how much the sophisticated
approach is buying us.

>
> >>
> >> The biggest thing I notice about this proposal is that although the
> focus
> >> of the proposal is on reducing profiling overhead, it requires using a
> >> different pass pipeline during *both* the "instr-generate" compilation
> and
> >> the "instr-use" compilation. Therefore it is more than simply a
> reduction in
> >> profiling overhead -- it can also change performance even ignoring
> adding
> >> the profiling information in the IR. I think it would be interesting to
> test
> >> the modified pass pipeline even in the absence of profile information to
> >> better understand its effects in isolation (for example, maybe it would
> be
> >> good to add these passes during -O3 regardless of whether we are doing
> PGO).
> >>
>
> The pipeline change is very PGO specific. I expect it has very little
> impact on regular compilations:
> 1) LLVM's bottom up inliner is already iterative.
> 2) The performance impact (on instrumented build) can be 4x large --
> which is unlikely for any nonPGO pipeline change.
>

With respect to adding extra passes, I'm actually more concerned about the
non-instrumented build, for which Rong did not show any data. For example,
will users find their program is X% faster than no PGO with ("ME") PGO, but
really (X/2)% of that is due simply to the extra passes, and not any
profile guidance? We should then prefer to use the extra passes during
regular -O3 builds. Conversely, if they find that their program is X%
faster with ("ME") PGO, but the extra passes are making the program (X/2)%
slower, then the users could be getting (3X/2)% faster instead. I am only
concerned about having two variables change simultaneously; I think that
instrumenting after some amount of cleanup has been done makes a lot of
sense.

Could Rong's proposal be made to work within the existing pipeline, but
doing the instrumentation after a subset of the existing pass pipeline has
been run?

-- Sean Silva

>
> LLVM already supports running SCC passes iteratively, so experiment
> like this will be easy to do -- the data can be collected.
>
> thanks,
>
> David
>
> >> -- Sean Silva
> >>
> >> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote:
> >>>
> >>> Instrumentation based Profile Guided Optimization (PGO) is a compiler
> >>> technique that leverages important program runtime information, such as
> >>> precise edge counts and frequent value information, to make frequently
> >>> executed code run faster. It's proven to be one of the most effective
> ways
> >>> to improve program performance.
> >>>
> >>> An important design point of PGO is to decide where to place the
> >>> instrumentation. In current LLVM PGO, the instrumentation is done in
> the
> >>> Clang Front-End (will be referred as FE based instrumentation). Doing
> this
> >>> early in the front-end gives better coverage information as the there
> are
> >>> more precise line number information. The resulted profile also has
> >>> relatively high tolerance to compiler changes. All the compiler change
> after
> >>> the instrumentation point will not lead to mismatched IR that
> invalidates
> >>> the profile.
> >>>
> >>> On the other hand, doing this too early gives the compiler fewer
> >>> opportunities to optimize the code before instrumentation. It has
> >>> significant impact on instrumentation runtime performance. In
> addition, it
> >>> tends to produce larger binary and  profile data file.  Our internal
> C++
> >>> benchmarks shows that FE based instrumentation degrades the performance
> >>> (compared to non-instrumented version) by 58.3%, and in some extreme
> cases,
> >>> the application speed/throughput decreases by 95% (21x runtime
> slowdown).
> >>>
> >>> Running the instrumented binary too slow is not desirable in PGO for
> the
> >>> following reasons:
> >>>    * This slows down already lengthy build time. In the worst case, the
> >>> instrumented binary is so slow that it fails to run a representative
> >>> workload, because slow execution can leads to more time-outs in many
> server
> >>> programs. Other typical issues include: text size too big, failure to
> link
> >>> instrumented binaries, memory usage exceeding system limits.
> >>>    * Slow runtime affects the program behavior. Real applications
> >>> sometimes monitor the program runtime and have different execution
> path when
> >>> the program run too slow. This would defeat the underlying assumption
> of PGO
> >>> and make it less effective.
> >>>
> >>> This work proposes an option to turn on middle-end based
> instrumentation
> >>> (new) that aims to speed up instrumentation runtime. The new
> instrumentation
> >>> is referred to as ME based instrumentation in this document. Our
> >>> experimental results show that ME instrumentation can speed up the
> >>> instrumentation by 80% on average for typical C++ programs. Here are
> the two
> >>> main design objectives:
> >>>    * Co-existing with FE Instrumenter: We do not propose to replace the
> >>> FE based instrumentation because FE based instrumentation has its
> advantages
> >>> and applications. User can choose which phase to do instrumentation via
> >>> command line options.
> >>>    * PGO Runtime Support Sharing: The ME instrumenter will completely
> >>> re-use the existing PGO’s runtime support.
> >>>
> >>> 1. FE Based Instrumentation Runtime Overhead Analysis
> >>>
> >>> Instrumented binaries are expected to run slower due to instrumentation
> >>> code inserted. With FE based instrumentation, the overhead is
> especially
> >>> high and runtime slowdown can be unacceptable in many cases.  Further
> >>> analysis shows that there are 3  important factors  contributing to FE
> >>> instrumentation slowdown :
> >>>    * [Main] Redundant counter updates of inlined functions. C++
> programs
> >>> can introduce large abstraction penalties by using lots of small inline
> >>> functions (assignment operators, getters, setters, ctors/dtors etc).
> >>> Overhead of instrumenting those small functions can be very large,
> making
> >>> training runs too slow and in some cases to usable;
> >>>    * Non-optimal placement of the count updates;
> >>>    * A third factor is related value profiling (to be turned on in the
> >>> future). Small and hot callee functions taking function pointer
> (callbacks)
> >>> can incur  overhead due to indirect call target profiling.
> >>>
> >>>
> >>> 1.1 Redundant Counter Update
> >>>
> >>> If checking the assembly of the instrumented binary generated by
> current
> >>> LLVM implementation, we can find many sequence of consecutive 'incq'
> >>> instructions that updating difference counters in the same basic
> block. As
> >>> an example that extracted from real binary:
> >>>   ...
> >>>  incq   0xa91d80(%rip)        # 14df4b8
> >>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
> >>>  incq   0xa79011(%rip)        # 14c6750
> >>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
> >>>  incq   0xa79442(%rip)        # 14c6b88
> >>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
> >>>  incq   0x9c288b(%rip)        # 140ffd8
> >>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
> >>>  ...
> >>>
> >>> From profile use point of view, many of these counter update are
> >>> redundant. Considering the following example:
> >>> void bar(){
> >>>  sum++;
> >>> }
> >>> void foo() {
> >>>  bar();
> >>> }
> >>>
> >>> FE based instrumentation needs to insert counter update for the only BB
> >>> of the bar().
> >>> bar:                                    # @bar
> >>> # BB#0:                                 # %entry
> >>>        incq    .L__llvm_profile_counters_bar(%rip)
> >>>        incl    sum(%rip)
> >>>        retq
> >>>
> >>> It also need to insert the update the BB in function foo().  After
> >>> inlining bar to foo(), the code is:
> >>> foo:                                    # @foo
> >>> # BB#0:                                 # %entry
> >>>        incq    .L__llvm_profile_counters_foo(%rip)
> >>>        incq    .L__llvm_profile_counters_bar(%rip)
> >>>        incl    sum(%rip)
> >>>        retq
> >>>
> >>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip)
> is
> >>> redundant -- the counter won't help downstream optimizations. On the
> other
> >>> hand, if bar() is a large function and may not be suitable to be
> inlined for
> >>> all callsites, this counter updated is necessary in order to produce
> more
> >>> accurate profile data for the out-of-line instance of the callee.
> >>>
> >>> If foo() is a hot function, the overhead of updating two counters can
> be
> >>> significant. This is especially bad for C++ program where there are
> tons of
> >>> small inline functions.
> >>>
> >>> There is another missing opportunity in FE based instrumentation. The
> >>> small functions’ control flow can usually be simplified when they are
> >>> inlined into caller contexts. Once the control flow is simplified, many
> >>> counter updates can therefore be eliminated. This is only possible for
> a
> >>> middle end based late instrumenter. Defining a custom clean-up pass to
> >>> remove redundant counter update is unrealistic and cannot be done in a
> sane
> >>> way without destroying the profile integrity of neither the
> out-of-line nor
> >>> inline instances of the callee.
> >>>
> >>> A much simpler and cleaner solution is to do a pre-inline pass to
> inline
> >>> all the trivial inlines before instrumentation.  In addition to
> removing the
> >>> unnecessary count updates for the inline instances,  another advantage
> of
> >>> pre-inline is to  provide context sensitive profile for these small
> inlined
> >>> functions. This context senstive profile can further improve the PGO
> based
> >>> optimizations. Here is a contrived example:
> >>> void bar (int n) {
> >>>   if (n&1)
> >>>     do_sth1();
> >>>   else
> >>>     do_sth2();
> >>> }
> >>>
> >>> void caller() {
> >>>   int s = 1;
> >>>   for (; s<100; s+=2)
> >>>     bar(s);
> >>>
> >>>   for (s = 102; s< 200; s+=2)
> >>>     bar(s);
> >>> }
> >>>
> >>> The direction of the branch inside bar will be totally opposite between
> >>> two different callsites in ‘caller’. Without pre-inlining, the branch
> >>> probability will be 50-50 which will be useless for later
> optimizations.
> >>> With pre-inlining, the profile will have the perfect branch count for
> each
> >>> callsite. The positive performance impact of context sensitive
> profiling due
> >>> to pre-inlining has been observed in real world large C++ programs.
> >>> Supporting context sensitive profiling is another way to solve this,
> but it
> >>> will introduce large additional runtime/memory overhead.
> >>>
> >>>
> >>> 1.2 Non-optimal placement of count update
> >>>
> >>> Another much smaller showdown factor is the placement of the counter
> >>> updates. Current front-end based instrumentation applies the
> instrumentation
> >>> to each front-end lexical construct. It also minimizes the number of
> static
> >>> instrumentations. Note that it always instruments the entry count of
> the
> >>> CFG. This may result in higher dynamic instruction counts. For example,
> >>>      BB0
> >>>      | 100
> >>>     BB1
> >>> 90 /   \ 10
> >>>   BB2  BB3
> >>> 90 \   / 10
> >>>     BB4
> >>> Like the the above example, FE based instrumentation will always insert
> >>> count update in BB0.  The dynamic instrumentation count will be either
> 110
> >>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A
> better
> >>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the
> dynamic
> >>> instrumentation count is 100.
> >>>
> >>> Our experimental shows that the optimal placement based on edge hotness
> >>> can improve instrumented code performance by about 10%.  While it’s
> hard to
> >>> find the optimal placement of count update,  compiler heuristics can
> be used
> >>> the get the better placement. These heuristics  can be based on static
> >>> profile prediction or user annotations (like __buildin_expect) to
> estimate
> >>> the relative edge hotness and put instrumentations on the less hot
> edges.
> >>> The initial late instrumentation has not fully implemented this
> placement
> >>> strategy yet.  With that implemented, we expect even better results
> than
> >>> what is reported here. For real world programs, another major source
> of the
> >>> slowdown is the data racing and false sharing of the counter update for
> >>> highly threaded programs. Pre-inlining can alleviate this problem as
> the
> >>> counters in the inline instances are not longer shared. But the
> complete
> >>> solution to the data racing issue is orthogonal to the problem we try
> to
> >>> solve here.
> >>>
> >>>
> >>> 2. High Level Design
> >>>
> >>> We propose to perform a pre-profile inline pass before the PGO
> >>> instrumentation pass. Since the instrumentation pass is after inine,
> it has
> >>> to be done in the middle-end.
> >>>
> >>> (1) The pre-inline pass
> >>> We will invoke a pre-inline pass before the instrumentation. When PGO
> is
> >>> on, the inlining will be split into two passes:
> >>>    * A pre-inline pass that is scheduled before the profile
> >>> instrumentation/annotation
> >>>    * A post-inline pass which is the regular inline pass after
> >>> instrumentation/annotation
> >>> By design, all beneficial callsites without requiring profile data
> should
> >>> be inlined in the pre-inline pass. It includes all callsites that will
> >>> shrink code size after inlining. All the remaining callsites will be
> left to
> >>> the regular inline pass when profile data is available.
> >>>
> >>> After pre-inline, a CFG based profile instrumentation/annotation will
> be
> >>> done. A minimum weight spanning tree (MST) in CFG is first computed,
> then
> >>> only the edges not in the MST will be instrumented. The counter update
> >>> instructions are placed in the basic blocks.
> >>>
> >>> (2) Minimum Spanning Tree (MST) based instrumentation
> >>> A native way of instrumentation is to insert a count update for every
> >>> edge in CFG which will result in  too many redundant updates that
> makes the
> >>> runtime very slow. Knuth [1] proposed a minimum spanning tree based
> method:
> >>> given a CFG, first compute a spanning tree. All edges that not in the
> MST
> >>> will be instrumented. In the profile use compilation, the counters are
> >>> populated (from the leaf of the spanning tree) to all the edges. Knuth
> >>> proved this method inserts the minimum number of instrumentations. MST
> based
> >>> method only guarantees the number static instrumentation are
> minimized, not
> >>> the dynamic instance of instrumentation. To reduce the number of
> dynamic
> >>> instrumentation, edges of potentially high counts will be put into MST
> first
> >>> so that they will have less chance to be instrumented.
> >>>
> >>>
> >>> 3. Experimental Results
> >>>
> >>> 3.1 Measurement of the efficiency of instrumentation
> >>> Other than the runtime of the instrumented binaries, a more direct
> >>> measurement of the instrumentation overhead is the the sum of the raw
> >>> profile count values. Note that regardless what kind of
> instrumentations are
> >>> used, the raw profile count should be able to reconstruct all the edge
> count
> >>> values for the whole program. All raw profile value are obtained via
> >>> incrementing the counter variable value by one. The sum of the raw
> profile
> >>> count value is roughly the dynamic instruction count of the
> instrumented
> >>> code. The lower of the value, the more efficient of the
> instrumentation.
> >>>
> >>>
> >>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and
> SPEC2K
> >>> eon
> >>> The performance speedup is computed by (FE_instrumentation_runtime /
> >>> ME_instrumentation_runtime - 1)
> >>>
> >>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon
> from
> >>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are
> small
> >>> ups and downs across different programs. Late instrumentation improves
> hmmer
> >>> a lot, but it is probably due to unrelated loop optimizations (90% of
> >>> runtime spent in one loop nest).
> >>>
> >>> For C++ programs, the performance impact of late instrumentation is
> very
> >>> large, which is as expected. The following table shows the result.
> For some
> >>> C++ program, the time speedup is huge. For example, in  483.xalancbmk,
> late
> >>> instrumentation speeds up performance by ~60%.  Among all the SPEC C++
> >>> programs, only 444.namd is an outlier -- it uses a lot of macros and
> is a
> >>> very C like program.
> >>>
> >>> Program           Speedup
> >>> 471.omnetpp       16.03%
> >>> 473.astar          5.00%
> >>> 483.xalancbmk     58.57%
> >>> 444.namd          -0.90%
> >>> 447.dealII        60.47%
> >>> 450.soplex         8.20%
> >>> 453.povray        11.34%
> >>> 252.eon           35.33%
> >>> -------------------------
> >>> Geomean           21.01%
> >>>
> >>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
> >>> We also collect some statistic of the profiles generated by FE based
> >>> instrumentation and late instrumentation, namely, the following
> information:
> >>>    1. the number of functions that being instrumented,
> >>>    2. the result profile file size,
> >>>    3. the sum of raw count values that was mentioned earlier -- we used
> >>> it to measure the efficiency of the instrumentation.
> >>> Next table shows the ratios of the each metrics by late instrumentation
> >>> for the C++ programs, with FE based instrumentation as the base :
> column (1)
> >>> shows the ratios of instrumented functions; column (2) shows the
> ratios of
> >>> the profile file size; column (3) shows the ratios of the sum of raw
> count
> >>> values.
> >>>
> >>>                 (1)       (2)       (3)
> >>> 471.omnetpp    85.36%   110.26%    46.52%
> >>> 473.astar      64.86%    72.72%    63.13%
> >>> 483.xalancbmk  51.83%    56.11%    35.77%
> >>> 444.namd       75.36%    82.82%    85.77%
> >>> 447.dealII     43.42%    46.46%    26.75%
> >>> 450.soplex     71.80%    87.54%    51.19%
> >>> 453.povray     78.68%    83.57%    64.37%
> >>> 252.eon        72.06%    91.22%    30.02%
> >>> ----------------------------------------
> >>> Geomean        66.50%    76.36%    47.01%
> >>>
> >>>
> >>> For FE based instrumentation, profile count variables generated for the
> >>> dead functions will not be removed (like __llvm_prf_names,
> __llvm_prf_data,
> >>> and __llvm_prf_cnts) from the data/text segment, nor in the result
> profile.
> >>> There is a recent patch that removes these unused data for COMDAT
> functions,
> >>> but that patch won’t touch regular functions. This is the main reason
> for
> >>> the larger number of instrumented functions and larger profile file
> size for
> >>> the FE based instrumentation. The reduction of the sum of raw count
> values
> >>> is mainly due to the elimination of redundant profile updates enabled
> by the
> >>> pre-inlining.
> >>>
> >>> For C programs, we observe similar improvement in the profile size
> >>> (geomean ratio of 73.75%) and smaller improvement in the number of
> >>> instrumented functions (geomean ratio of 87.49%) and the sum of raw
> count
> >>> values (geomean of 82.76%).
> >>>
> >>>
> >>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++
> >>> benchmarks
> >>>
> >>> We also use Google internal benchmarks (mostly typical C++
> applications)
> >>> to measure the relative performance b/w FE based instrumentation and
> late
> >>> instrumentation.  The following table shows the speedup of late
> >>> instrumentation vs FE based instrumentation. Note that C++benchmark01
> is a
> >>> very large multi-threaded C++ program. Late instrumentation sees 4x
> speedup.
> >>> Larger than 3x speedups are also seen in many other programs.
> >>>
> >>> C++_bencharmk01    416.98%
> >>> C++_bencharmk02      6.29%
> >>> C++_bencharmk03     22.39%
> >>> C++_bencharmk04     28.05%
> >>> C++_bencharmk05      2.00%
> >>> C++_bencharmk06    675.89%
> >>> C++_bencharmk07    359.19%
> >>> C++_bencharmk08    395.03%
> >>> C_bencharmk09       15.11%
> >>> C_bencharmk10        5.47%
> >>> C++_bencharmk11      5.73%
> >>> C++_bencharmk12      2.31%
> >>> C++_bencharmk13     87.73%
> >>> C++_bencharmk14      7.22%
> >>> C_bencharmk15       -0.51%
> >>> C++_bencharmk16     59.15%
> >>> C++_bencharmk17    358.82%
> >>> C++_bencharmk18    861.36%
> >>> C++_bencharmk19     29.62%
> >>> C++_bencharmk20     11.82%
> >>> C_bencharmk21        0.53%
> >>> C++_bencharmk22     43.10%
> >>> ---------------------------
> >>> Geomean             83.03%
> >>>
> >>>
> >>> 3.5 Statistics of LLVM profiles for Google internal benchmarks
> >>>
> >>> The following shows the profile statics using Google internal
> benchmarks.
> >>>                          (1)       (2)       (3)
> >>> C++_bencharmk01         36.84%    40.29%     2.32%
> >>> C++_bencharmk02         39.20%    40.49%    42.39%
> >>> C++_bencharmk03         39.37%    40.65%    23.24%
> >>> C++_bencharmk04         39.13%    40.68%    17.70%
> >>> C++_bencharmk05         36.58%    38.27%    51.08%
> >>> C++_bencharmk06         29.50%    27.87%     2.87%
> >>> C++_bencharmk07         29.50%    27.87%     1.73%
> >>> C++_bencharmk08         29.50%    27.87%     4.17%
> >>> C_bencharmk09           53.95%    68.00%    11.18%
> >>> C_bencharmk10           53.95%    68.00%    31.74%
> >>> C++_bencharmk11         36.40%    37.07%    46.12%
> >>> C++_bencharmk12         38.44%    41.90%    73.59%
> >>> C++_bencharmk13         39.28%    42.72%    29.56%
> >>> C++_bencharmk14         38.59%    42.20%    13.42%
> >>> C_bencharmk15           57.45%    48.50%    66.99%
> >>> C++_bencharmk16         36.86%    42.18%    16.53%
> >>> C++_bencharmk17         37.82%    39.77%    13.68%
> >>> C++_bencharmk18         37.82%    39.77%     7.96%
> >>> C++_bencharmk19         37.52%    40.46%     1.85%
> >>> C++_bencharmk20         32.37%    30.44%    19.69%
> >>> C_bencharmk21           37.63%    40.42%    88.81%
> >>> C++_bencharmk22         36.28%    36.92%    21.62%
> >>> --------------------------------------------------
> >>> Geomean                 38.22%    39.96%    15.58%
> >>>
> >>>
> >>> 4. Implementation Details:
> >>>
> >>> We need to add new option(s) for the alternative PGO instrumentation
> pass
> >>> in the middle end. It can in one of the following forms:
> >>>
> >>>    (1) Complete new options that are on par with current PGO options:
> >>> -fprofile-late-instr-generate[=<profile_file>]? for PGO
> Instrumentation, and
> >>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
> >>>    (2) Or, late instrumentation can be turned on with an additional
> >>> option -fprofile-instr-late with current PGO options. I. e.
> >>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO
> >>> instrumentation, and -fprofile-instr-late
> >>> -fprofile-instr-use[=<profile_file>]? for PGO use.
> >>>    (3) Alternatively to (2), only keep -fprofile-instr-late option in
> PGO
> >>> instrumentation. Adding a magic tag in profile so that FE based
> profile and
> >>> late instrumented profile can be automatically detected by profile
> loader In
> >>> PGO use compilation. This requires a slight profile format change.
> >>>
> >>> In our prototype implementation, two new passes are added in the
> >>> beginning of PassManagerBuilder::populateModulePassManager(), namely
> >>> PreProfileInlinerPass and PGOLateInstrumentationPass.
> >>>
> >>>
> >>> 4.1 Pre-inline pass:
> >>>
> >>> It is controlled by back-end option "-preinline" and
> >>> "-disable-preinline". If the user specifies any llvm option of
> >>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will
> be
> >>> automatically inserted in the driver.. To disable the pre-inliner when
> late
> >>> instrumentation is enabled, use option "-mllvm -disable-preinline".
> >>>
> >>> For now, only minimum tuning is done for the pre-inliner, which simply
> >>> adjusts the inline threshold: If -Oz is specified, the threshold is
> set to
> >>> 25. Otherwise, it is 75.
> >>>
> >>> The following clean up passes are added to PassManager, right after the
> >>> PreProfileInline pass:
> >>>   createEarlyCSEPass()
> >>>   createJumpThreadingPass()
> >>>   createCorrelatedValuePropagationPass()
> >>>   createCFGSimplificationPass()
> >>>   createInstructionCombiningPass()
> >>>   createGVNPass(DisableGVNLoadPRE)
> >>>   createPeepholePASS()
> >>> Some of them might not be necessary.
> >>>
> >>> 4.2 Late Instrumentation Pass:
> >>> The late instrumentation is right after the pre-inline pass and it's
> >>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen"
> and
> >>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver
> will
> >>> provide the profile name.
> >>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass()
> >>> is also added to PassManager to lower the instrumentation intrinsics
> >>>
> >>> PGOLateInstrumeatnion is a module pass that applies the instrumentation
> >>> to each function by class PGOLateInstrumentationFunc. For each
> function,
> >>> perform the following steps:
> >>>    1. First collect all the CFG edges. Assign an estimated weight to
> each
> >>> edge. Critical edges and back-edges are assigned to high value of
> weights.
> >>> One fake node and a few fake edges (from the fake node to the entry
> node,
> >>> and from all the exit nodes to the fake node) are also added to the
> >>> worklist.
> >>>    2. Construct the MST. The edges with the higher weight will be put
> to
> >>> MST first, unless it forms a cycle.
> >>>    3. Traverse the CFG and compute the CFG hash using CRC32 of the
> index
> >>> of each BB.
> >>> The above three steps are the same for profile-generate and profile-use
> >>> compilation.
> >>>
> >>> In the next step, for profile-generation compilation, all the edges
> that
> >>> not in the MST are instrumented. If this is a critical edge, split the
> edge
> >>> first. The actual instrumentation is to generate
> >>> Intrinsic::instrprof_increment() in the instrumented BB. This
> intrinsic will
> >>> be lowed by pass createInstrProfilingPass().
> >>>
> >>> In the next step, for profile-generation compilation, all the edges
> that
> >>> not in the MST are instrumented. If this is a critical edge, split the
> edge
> >>> first. The actual instrumentation is to generate
> >>> Intrinsic::instrprof_increment() in the instrumented BB. This
> intrinsic will
> >>> be lowed by pass createInstrProfilingPass().
> >>>
> >>> For -fprofile-use compilation, first read in the counters and the CFG
> >>> hash from the profile file. If the CFG hash matches, populate the
> counters
> >>> to all the edges in reverse topological order of the MST. Once having
> all
> >>> the edge counts, set the branch weights metadata for the IR having
> multiple
> >>> branches. Also apply the cold/hot function attributes based on function
> >>> level counts.
> >>>
> >>>
> >>> 4.3 Profile Format:
> >>>
> >>> The late instrumentation profile is mostly the same as the one from
> >>> front-end instrument-ion. The difference is
> >>>    * Function checksums are different.
> >>>    * Function entry counts are no longer available.
> >>> For llvm-profdata utility, options -lateinstr needs to be used to
> >>> differentiate FE based and late instrumentation profiles, unless a
> magic tag
> >>> is added to the profile.
> >>>
> >>>
> >>> 5. References:
> >>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of
> points
> >>> for program frequency counts. BIT Numerical Mathematics 1973, Volume
> 13,
> >>> Issue 3, pp 313-322
> >>>
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150809/cc486e38/attachment.html>