[llvm-dev] RFC: PGO Late instrumentation for LLVM

Rong Xu via llvm-dev llvm-dev at lists.llvm.org
Fri Aug 7 17:03:37 PDT 2015

Instrumentation based Profile Guided Optimization (PGO) is a compiler
technique that leverages important program runtime information, such as
precise edge counts and frequent value information, to make frequently
executed code run faster. It's proven to be one of the most effective ways
to improve program performance.

An important design point of PGO is to decide where to place the
instrumentation. In current LLVM PGO, the instrumentation is done in the
Clang Front-End (will be referred as FE based instrumentation). Doing this
early in the front-end gives better coverage information as the there are
more precise line number information. The resulted profile also has
relatively high tolerance to compiler changes. All the compiler change
after the instrumentation point will not lead to mismatched IR that
invalidates the profile.

On the other hand, doing this too early gives the compiler fewer
opportunities to optimize the code before instrumentation. It has
significant impact on instrumentation runtime performance. In addition, it
tends to produce larger binary and  profile data file.  Our internal C++
benchmarks shows that FE based instrumentation degrades the performance
(compared to non-instrumented version) by 58.3%, and in some extreme cases,
the application speed/throughput decreases by 95% (21x runtime slowdown).

Running the instrumented binary too slow is not desirable in PGO for the
following reasons:
   * This slows down already lengthy build time. In the worst case, the
instrumented binary is so slow that it fails to run a representative
workload, because slow execution can leads to more time-outs in many server
programs. Other typical issues include: text size too big, failure to link
instrumented binaries, memory usage exceeding system limits.
   * Slow runtime affects the program behavior. Real applications sometimes
monitor the program runtime and have different execution path when the
program run too slow. This would defeat the underlying assumption of PGO
and make it less effective.

This work proposes an option to turn on middle-end based instrumentation
(new) that aims to speed up instrumentation runtime. The new
instrumentation is referred to as ME based instrumentation in this
document. Our experimental results show that ME instrumentation can speed
up the instrumentation by 80% on average for typical C++ programs. Here are
the two main design objectives:
   * Co-existing with FE Instrumenter: We do not propose to replace the FE
based instrumentation because FE based instrumentation has its advantages
and applications. User can choose which phase to do instrumentation via
command line options.
   * PGO Runtime Support Sharing: The ME instrumenter will completely
re-use the existing PGO’s runtime support.

1. FE Based Instrumentation Runtime Overhead Analysis

Instrumented binaries are expected to run slower due to instrumentation
code inserted. With FE based instrumentation, the overhead is especially
high and runtime slowdown can be unacceptable in many cases.  Further
analysis shows that there are 3  important factors  contributing to FE
instrumentation slowdown :
   * [Main] Redundant counter updates of inlined functions. C++ programs
can introduce large abstraction penalties by using lots of small inline
functions (assignment operators, getters, setters, ctors/dtors etc).
Overhead of instrumenting those small functions can be very large, making
training runs too slow and in some cases to usable;
   * Non-optimal placement of the count updates;
   * A third factor is related value profiling (to be turned on in the
future). Small and hot callee functions taking function pointer (callbacks)
can incur  overhead due to indirect call target profiling.

1.1 Redundant Counter Update

If checking the assembly of the instrumented binary generated by current
LLVM implementation, we can find many sequence of consecutive 'incq'
instructions that updating difference counters in the same basic block. As
an example that extracted from real binary:
 incq   0xa91d80(%rip)        # 14df4b8
 incq   0xa79011(%rip)        # 14c6750
 incq   0xa79442(%rip)        # 14c6b88
 incq   0x9c288b(%rip)        # 140ffd8

>From profile use point of view, many of these counter update are redundant.
Considering the following example:
void bar(){
void foo() {

FE based instrumentation needs to insert counter update for the only BB of
the bar().
bar:                                    # @bar
# BB#0:                                 # %entry
       incq    .L__llvm_profile_counters_bar(%rip)
       incl    sum(%rip)

It also need to insert the update the BB in function foo().  After inlining
bar to foo(), the code is:
foo:                                    # @foo
# BB#0:                                 # %entry
       incq    .L__llvm_profile_counters_foo(%rip)
       incq    .L__llvm_profile_counters_bar(%rip)
       incl    sum(%rip)

If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
redundant -- the counter won't help downstream optimizations. On the other
hand, if bar() is a large function and may not be suitable to be inlined
for all callsites, this counter updated is necessary in order to produce
more accurate profile data for the out-of-line instance of the callee.

If foo() is a hot function, the overhead of updating two counters can be
significant. This is especially bad for C++ program where there are tons of
small inline functions.

There is another missing opportunity in FE based instrumentation. The small
functions’ control flow can usually be simplified when they are inlined
into caller contexts. Once the control flow is simplified, many counter
updates can therefore be eliminated. This is only possible for a middle end
based late instrumenter. Defining a custom clean-up pass to remove
redundant counter update is unrealistic and cannot be done in a sane way
without destroying the profile integrity of neither the out-of-line nor
inline instances of the callee.

A much simpler and cleaner solution is to do a pre-inline pass to inline
all the trivial inlines before instrumentation.  In addition to removing
the unnecessary count updates for the inline instances,  another advantage
of pre-inline is to  provide context sensitive profile for these small
inlined functions. This context senstive profile can further improve the
PGO based optimizations. Here is a contrived example:
void bar (int n) {
  if (n&1)

void caller() {
  int s = 1;
  for (; s<100; s+=2)

  for (s = 102; s< 200; s+=2)

The direction of the branch inside bar will be totally opposite between two
different callsites in ‘caller’. Without pre-inlining, the branch
probability will be 50-50 which will be useless for later optimizations.
With pre-inlining, the profile will have the perfect branch count for each
callsite. The positive performance impact of context sensitive profiling
due to pre-inlining has been observed in real world large C++ programs.
Supporting context sensitive profiling is another way to solve this, but it
will introduce large additional runtime/memory overhead.

1.2 Non-optimal placement of count update

Another much smaller showdown factor is the placement of the counter
updates. Current front-end based instrumentation applies the
instrumentation to each front-end lexical construct. It also minimizes the
number of static instrumentations. Note that it always instruments the
entry count of the CFG. This may result in higher dynamic instruction
counts. For example,
     | 100
90 /   \ 10
  BB2  BB3
90 \   / 10
Like the the above example, FE based instrumentation will always insert
count update in BB0.  The dynamic instrumentation count will be either 110
(Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better
instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic
instrumentation count is 100.

Our experimental shows that the optimal placement based on edge hotness can
improve instrumented code performance by about 10%.  While it’s hard to
find the optimal placement of count update,  compiler heuristics can be
used the get the better placement. These heuristics  can be based on static
profile prediction or user annotations (like __buildin_expect) to  estimate
the relative edge hotness and put instrumentations on the less hot edges.
The initial late instrumentation has not fully implemented this placement
strategy yet.  With that implemented, we expect even better results than
what is reported here. For real world programs, another major source of the
slowdown is the data racing and false sharing of the counter update for
highly threaded programs. Pre-inlining can alleviate this problem as the
counters in the inline instances are not longer shared. But the complete
solution to the data racing issue is orthogonal to the problem we try to
solve here.

2. High Level Design

We propose to perform a pre-profile inline pass before the PGO
instrumentation pass. Since the instrumentation pass is after inine, it has
to be done in the middle-end.

(1) The pre-inline pass
We will invoke a pre-inline pass before the instrumentation. When PGO is
on, the inlining will be split into two passes:
   * A pre-inline pass that is scheduled before the profile
   * A post-inline pass which is the regular inline pass after
By design, all beneficial callsites without requiring profile data should
be inlined in the pre-inline pass. It includes all callsites that will
shrink code size after inlining. All the remaining callsites will be left
to the regular inline pass when profile data is available.

After pre-inline, a CFG based profile instrumentation/annotation will be
done. A minimum weight spanning tree (MST) in CFG is first computed, then
only the edges not in the MST will be instrumented. The counter update
instructions are placed in the basic blocks.

(2) Minimum Spanning Tree (MST) based instrumentation
A native way of instrumentation is to insert a count update for every edge
in CFG which will result in  too many redundant updates that makes the
runtime very slow. Knuth [1] proposed a minimum spanning tree based method:
given a CFG, first compute a spanning tree. All edges that not in the MST
will be instrumented. In the profile use compilation, the counters are
populated (from the leaf of the spanning tree) to all the edges. Knuth
proved this method inserts the minimum number of instrumentations. MST
based method only guarantees the number static instrumentation are
minimized, not the dynamic instance of instrumentation. To reduce the
number of dynamic instrumentation, edges of potentially high counts will be
put into MST first so that they will have less chance to be instrumented.

3. Experimental Results

3.1 Measurement of the efficiency of instrumentation
Other than the runtime of the instrumented binaries, a more direct
measurement of the instrumentation overhead is the the sum of the raw
profile count values. Note that regardless what kind of instrumentations
are used, the raw profile count should be able to reconstruct all the edge
count values for the whole program. All raw profile value are obtained via
incrementing the counter variable value by one. The sum of the raw profile
count value is roughly the dynamic instruction count of the instrumented
code. The lower of the value, the more efficient of the instrumentation.

3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K eon
The performance speedup is computed by (FE_instrumentation_runtime /
ME_instrumentation_runtime - 1)

We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from
SPEC2000. For C programs, except for one outlier 456.hmmer, there are small
ups and downs across different programs. Late instrumentation improves
hmmer a lot, but it is probably due to unrelated loop optimizations (90% of
runtime spent in one loop nest).

For C++ programs, the performance impact of late instrumentation is very
large, which is as expected. The following table shows the result.  For
some C++ program, the time speedup is huge. For example, in  483.xalancbmk,
late instrumentation speeds up performance by ~60%.  Among all the SPEC C++
programs, only 444.namd is an outlier -- it uses a lot of macros and is a
very C like program.

Program           Speedup
471.omnetpp       16.03%
473.astar          5.00%
483.xalancbmk     58.57%
444.namd          -0.90%
447.dealII        60.47%
450.soplex         8.20%
453.povray        11.34%
252.eon           35.33%
Geomean           21.01%

3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
We also collect some statistic of the profiles generated by FE based
instrumentation and late instrumentation, namely, the following information:
   1. the number of functions that being instrumented,
   2. the result profile file size,
   3. the sum of raw count values that was mentioned earlier -- we used it
to measure the efficiency of the instrumentation.
Next table shows the ratios of the each metrics by late instrumentation for
the C++ programs, with FE based instrumentation as the base : column (1)
shows the ratios of instrumented functions; column (2) shows the ratios of
the profile file size; column (3) shows the ratios of the sum of raw count

                (1)       (2)       (3)
471.omnetpp    85.36%   110.26%    46.52%
473.astar      64.86%    72.72%    63.13%
483.xalancbmk  51.83%    56.11%    35.77%
444.namd       75.36%    82.82%    85.77%
447.dealII     43.42%    46.46%    26.75%
450.soplex     71.80%    87.54%    51.19%
453.povray     78.68%    83.57%    64.37%
252.eon        72.06%    91.22%    30.02%
Geomean        66.50%    76.36%    47.01%

For FE based instrumentation, profile count variables generated for the
dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data,
 and __llvm_prf_cnts) from the data/text segment, nor in the result
profile. There is a recent patch that removes these unused data for COMDAT
functions, but that patch won’t touch regular functions. This is the main
reason for the larger number of instrumented functions and larger profile
file size for the FE based instrumentation. The reduction of the sum of raw
count values is mainly due to the elimination of redundant profile updates
enabled by the pre-inlining.

For C programs, we observe similar improvement in the profile size (geomean
ratio of 73.75%) and smaller improvement in the number of instrumented
functions (geomean ratio of 87.49%) and the sum of raw count values
(geomean of 82.76%).

3.4 LLVM instrumentations runtime performance for Google internal C/C++

We also use Google internal benchmarks (mostly typical C++ applications) to
measure the relative performance b/w FE based instrumentation and late
instrumentation.  The following table shows the speedup of late
instrumentation vs FE based instrumentation. Note that C++benchmark01 is a
very large multi-threaded C++ program. Late instrumentation sees 4x
speedup. Larger than 3x speedups are also seen in many other programs.

C++_bencharmk01    416.98%
C++_bencharmk02      6.29%
C++_bencharmk03     22.39%
C++_bencharmk04     28.05%
C++_bencharmk05      2.00%
C++_bencharmk06    675.89%
C++_bencharmk07    359.19%
C++_bencharmk08    395.03%
C_bencharmk09       15.11%
C_bencharmk10        5.47%
C++_bencharmk11      5.73%
C++_bencharmk12      2.31%
C++_bencharmk13     87.73%
C++_bencharmk14      7.22%
C_bencharmk15       -0.51%
C++_bencharmk16     59.15%
C++_bencharmk17    358.82%
C++_bencharmk18    861.36%
C++_bencharmk19     29.62%
C++_bencharmk20     11.82%
C_bencharmk21        0.53%
C++_bencharmk22     43.10%
Geomean             83.03%

3.5 Statistics of LLVM profiles for Google internal benchmarks

The following shows the profile statics using Google internal benchmarks.
                         (1)       (2)       (3)
C++_bencharmk01         36.84%    40.29%     2.32%
C++_bencharmk02         39.20%    40.49%    42.39%
C++_bencharmk03         39.37%    40.65%    23.24%
C++_bencharmk04         39.13%    40.68%    17.70%
C++_bencharmk05         36.58%    38.27%    51.08%
C++_bencharmk06         29.50%    27.87%     2.87%
C++_bencharmk07         29.50%    27.87%     1.73%
C++_bencharmk08         29.50%    27.87%     4.17%
C_bencharmk09           53.95%    68.00%    11.18%
C_bencharmk10           53.95%    68.00%    31.74%
C++_bencharmk11         36.40%    37.07%    46.12%
C++_bencharmk12         38.44%    41.90%    73.59%
C++_bencharmk13         39.28%    42.72%    29.56%
C++_bencharmk14         38.59%    42.20%    13.42%
C_bencharmk15           57.45%    48.50%    66.99%
C++_bencharmk16         36.86%    42.18%    16.53%
C++_bencharmk17         37.82%    39.77%    13.68%
C++_bencharmk18         37.82%    39.77%     7.96%
C++_bencharmk19         37.52%    40.46%     1.85%
C++_bencharmk20         32.37%    30.44%    19.69%
C_bencharmk21           37.63%    40.42%    88.81%
C++_bencharmk22         36.28%    36.92%    21.62%
Geomean                 38.22%    39.96%    15.58%

4. Implementation Details:

We need to add new option(s) for the alternative PGO instrumentation pass
in the middle end. It can in one of the following forms:

   (1) Complete new options that are on par with current PGO options:
-fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation,
and -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
   (2) Or, late instrumentation can be turned on with an additional option
-fprofile-instr-late with current PGO options. I. e. -fprofile-instr-late
-fprofile-instr-generate[=<profile_file>]? for PGO instrumentation, and
-fprofile-instr-late -fprofile-instr-use[=<profile_file>]? for PGO use.
   (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO
instrumentation. Adding a magic tag in profile so that FE based profile and
late instrumented profile can be automatically detected by profile loader
In PGO use compilation. This requires a slight profile format change.

In our prototype implementation, two new passes are added in the beginning
of PassManagerBuilder::populateModulePassManager(), namely
PreProfileInlinerPass and PGOLateInstrumentationPass.

4.1 Pre-inline pass:

It is controlled by back-end option "-preinline" and "-disable-preinline".
If the user specifies any llvm option of
"-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be
automatically inserted in the driver.. To disable the pre-inliner when late
instrumentation is enabled, use option "-mllvm -disable-preinline".

For now, only minimum tuning is done for the pre-inliner, which simply
adjusts the inline threshold: If -Oz is specified, the threshold is set to
25. Otherwise, it is 75.

The following clean up passes are added to PassManager, right after the
PreProfileInline pass:
Some of them might not be necessary.

4.2 Late Instrumentation Pass:
The late instrumentation is right after the pre-inline pass and it's
cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and
"-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will
provide the profile name.
For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() is
also added to PassManager to lower the instrumentation intrinsics

PGOLateInstrumeatnion is a module pass that applies the instrumentation to
each function by class PGOLateInstrumentationFunc. For each function,
perform the following steps:
   1. First collect all the CFG edges. Assign an estimated weight to each
edge. Critical edges and back-edges are assigned to high value of weights.
One fake node and a few fake edges (from the fake node to the entry node,
and from all the exit nodes to the fake node) are also added to the
   2. Construct the MST. The edges with the higher weight will be put to
MST first, unless it forms a cycle.
   3. Traverse the CFG and compute the CFG hash using CRC32 of the index of
each BB.
The above three steps are the same for profile-generate and profile-use

In the next step, for profile-generation compilation, all the edges that
not in the MST are instrumented. If this is a critical edge, split the edge
first. The actual instrumentation is to generate
Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic
will be lowed by pass createInstrProfilingPass().

In the next step, for profile-generation compilation, all the edges that
not in the MST are instrumented. If this is a critical edge, split the edge
first. The actual instrumentation is to generate
Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic
will be lowed by pass createInstrProfilingPass().

For -fprofile-use compilation, first read in the counters and the CFG hash
from the profile file. If the CFG hash matches, populate the counters to
all the edges in reverse topological order of the MST. Once having all the
edge counts, set the branch weights metadata for the IR having multiple
branches. Also apply the cold/hot function attributes based on function
level counts.

4.3 Profile Format:

The late instrumentation profile is mostly the same as the one from
front-end instrument-ion. The difference is
   * Function checksums are different.
   * Function entry counts are no longer available.
For llvm-profdata utility, options -lateinstr needs to be used to
differentiate FE based and late instrumentation profiles, unless a magic
tag is added to the profile.

5. References:
[1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points
for program frequency counts. BIT Numerical Mathematics 1973, Volume 13,
Issue 3, pp 313-322
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150807/f7e69f80/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGOLateinstrumentationforLLVM.pdf
Type: application/pdf
Size: 273556 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150807/f7e69f80/attachment-0001.pdf>

More information about the llvm-dev mailing list