<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Aug 9, 2015 at 6:23 PM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Sat, Aug 8, 2015 at 6:31 AM, Xinliang David Li <span dir="ltr"><<a href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>> wrote:<br>

> Accidentally sent to uiuc server.<br>

><br>

><br>

> On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>> wrote:<br>

>><br>

>> Can you compare your results with another approach: simply do not<br>

>> instrument the top 1% hottest functions (by function entry count)? If this<br>

>> simple approach provides most of the benefits (my measurements on one<br>

>> codebase I tested show that it would eliminate over 97% of total function<br>

</span>>> counts), we may be able to use a simpler approach.AFor large C++ programs, the<br>

<br>

For static compiler, this is not possible. It also seem to defeat the<br>

purpose of PGO -- hottest functions are those which need profile<br>

guidance the most.<br></blockquote><div><br></div></span><div>In the program I looked at the top 1% were just trivial getters and constructors or similar. We should already be "getting these right".</div><div><br></div><div>Stuff like:</div><div><br></div><div>class Foo {</div><div>....</div><div>Foo(int bar) : m_bar(bar) {}</div><div>int getBar() { return m_bar; }</div><div>...</div><div>};</div><div><br></div><div>Are the results different for your codebases? Have you tried something like simply not instrumenting the hottest 1% or 0.5% of functions? (maybe restrict the instrumentation skipping to functions of just a single BB with less than, say, 10 instructions).</div><div><br></div><div>Rong's approach is quite sophisticated; I'm just interested on getting a sanity check against a "naive" approach to see how much the sophisticated approach is buying us.</div></div></div></div></blockquote><div><br></div><div>I've seen plenty of this kind of code patterns in the benchmarks. I did not measure if they are the top 1% of hottest functions, but I'm not surprised to see so.</div><div>Your suggesting of not instrumenting this small single BB function is interesting. I'm pretty sure it will cut a lot of the overhead.</div><div>But will not destroy the integrity of the profile, (for example, it will harder to be used in the coverage test).</div><div><br></div><div>I'm curious why you think my approach is sophisticated. It's pretty simple in high level: do a preinline pass to inline all the trivial inlines and then do the instrumentation. This way the instrumentation binary is faster and result profile is smaller,  and complete.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span><br>

>><br>

>> The biggest thing I notice about this proposal is that although the focus<br>

>> of the proposal is on reducing profiling overhead, it requires using a<br>

>> different pass pipeline during *both* the "instr-generate" compilation and<br>

>> the "instr-use" compilation. Therefore it is more than simply a reduction in<br>

>> profiling overhead -- it can also change performance even ignoring adding<br>

>> the profiling information in the IR. I think it would be interesting to test<br>

>> the modified pass pipeline even in the absence of profile information to<br>

>> better understand its effects in isolation (for example, maybe it would be<br>

>> good to add these passes during -O3 regardless of whether we are doing PGO).<br>

>><br>

<br>

</span>The pipeline change is very PGO specific. I expect it has very little<br>

impact on regular compilations:<br>

1) LLVM's bottom up inliner is already iterative.<br>

2) The performance impact (on instrumented build) can be 4x large --<br>

which is unlikely for any nonPGO pipeline change.<br></blockquote><div><br></div></span><div>With respect to adding extra passes, I'm actually more concerned about the non-instrumented build, for which Rong did not show any data. For example, will users find their program is X% faster than no PGO with ("ME") PGO, but really (X/2)% of that is due simply to the extra passes, and not any profile guidance? We should then prefer to use the extra passes during regular -O3 builds. Conversely, if they find that their program is X% faster with ("ME") PGO, but the extra passes are making the program (X/2)% slower, then the users could be getting (3X/2)% faster instead. I am only concerned about having two variables change simultaneously; I think that instrumenting after some amount of cleanup has been done makes a lot of sense.</div><div><br></div></div></div></div></blockquote><div><br></div><div>I agree with David that LLVM's bottom up, and iterative inliner should make the pre-inline pass have little performance impact in regular build, at least in theory.</div><div>The performance swing caused by the pre-inline is likely due to inliner implementation/heuristic issues, and should be handled in the inliner.</div><div><br></div><div>Here is the performance data to enable pre-inliner for regular compilation (i.e. O2).</div><div><br></div><div>SPEC2006 and SPEC2000 C++ programs (train input)</div><div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">program          O2_+_preinline_time / O2_time - 1</font></div><div><font face="monospace, monospace">471.omnetpp            0.40%</font></div><div><font face="monospace, monospace">473.astar             11.62%</font></div><div><font face="monospace, monospace">483.xalancbmk         -3.58%</font></div><div><font face="monospace, monospace">444.namd               0.00%</font></div><div><font face="monospace, monospace">447.dealII             1.02%</font></div><div><font face="monospace, monospace">450.soplex             0.83%</font></div><div><font face="monospace, monospace">453.povray            -1.81%</font></div><div><font face="monospace, monospace">252.eon                0.33%</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Note this uses the time. 473.astar runs 11.62% longer with preinline pass.</font></div><div><br></div></div><div>For Google Internal benchmarks, we use the performance number</div><div>                                        O2_+_pre_inline speedup vs O2</div><div><div><font face="monospace, monospace">C++benchmark01            -6.88%</font></div><div><font face="monospace, monospace">C++benchmark02            +4.42%</font></div><div><font face="monospace, monospace">C++benchmark03            +0.74%</font></div><div><font face="monospace, monospace">C++benchmark04            +0.56%</font></div><div><font face="monospace, monospace">C++benchmark05            -1.02%</font></div><div><font face="monospace, monospace">C++benchmark06           -14.98%</font></div><div><font face="monospace, monospace">C++benchmark07            +4.21%</font></div><div><font face="monospace, monospace">C++benchmark08            -1.00%</font></div><div><font face="monospace, monospace">C_benchmark09             +0.92%</font></div><div><font face="monospace, monospace">C_benchmark10             -0.43%</font></div><div><font face="monospace, monospace">C++benchmark11           -12.23%</font></div><div><font face="monospace, monospace">C++benchmark12            +0.65%</font></div><div><font face="monospace, monospace">C++benchmark13            -0.96%</font></div><div><font face="monospace, monospace">C++benchmark14           -12.89%</font></div><div><font face="monospace, monospace">C_benchmark15             -0.12%</font></div><div><font face="monospace, monospace">C++benchmark16            -0.07%</font></div><div><font face="monospace, monospace">C++benchmark17            -0.18%</font></div><div><font face="monospace, monospace">C++benchmark18            +0.24%</font></div><div><font face="monospace, monospace">C++benchmark19            -1.86%</font></div><div><font face="monospace, monospace">C++benchmark20            +3.63%</font></div><div><font face="monospace, monospace">C_benchmark21             +0.40%</font></div><div><font face="monospace, monospace">C++benchmark22            -0.08%</font></div><div><font face="monospace, monospace">--------------------------------</font></div><div><font face="monospace, monospace">geometric mean           -1.82%   </font></div></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">For many benchmarks, we get negative speedup (ie.e slowdown).  </font></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div>Could Rong's proposal be made to work within the existing pipeline, but doing the instrumentation after a subset of the existing pass pipeline has been run?</div><span class="HOEnZb"><font color="#888888"><div><br></div><div>-- Sean Silva</div></font></span><div><div class="h5"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

LLVM already supports running SCC passes iteratively, so experiment<br>

like this will be easy to do -- the data can be collected.<br>

<br>

thanks,<br>

<br>

David<br>

<div><div><br>

>> -- Sean Silva<br>

>><br>

>> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <<a href="mailto:xur@google.com" target="_blank">xur@google.com</a>> wrote:<br>

>>><br>

>>> Instrumentation based Profile Guided Optimization (PGO) is a compiler<br>

>>> technique that leverages important program runtime information, such as<br>

>>> precise edge counts and frequent value information, to make frequently<br>

>>> executed code run faster. It's proven to be one of the most effective ways<br>

>>> to improve program performance.<br>

>>><br>

>>> An important design point of PGO is to decide where to place the<br>

>>> instrumentation. In current LLVM PGO, the instrumentation is done in the<br>

>>> Clang Front-End (will be referred as FE based instrumentation). Doing this<br>

>>> early in the front-end gives better coverage information as the there are<br>

>>> more precise line number information. The resulted profile also has<br>

>>> relatively high tolerance to compiler changes. All the compiler change after<br>

>>> the instrumentation point will not lead to mismatched IR that invalidates<br>

>>> the profile.<br>

>>><br>

>>> On the other hand, doing this too early gives the compiler fewer<br>

>>> opportunities to optimize the code before instrumentation. It has<br>

>>> significant impact on instrumentation runtime performance. In addition, it<br>

>>> tends to produce larger binary and  profile data file.  Our internal C++<br>

>>> benchmarks shows that FE based instrumentation degrades the performance<br>

>>> (compared to non-instrumented version) by 58.3%, and in some extreme cases,<br>

>>> the application speed/throughput decreases by 95% (21x runtime slowdown).<br>

>>><br>

>>> Running the instrumented binary too slow is not desirable in PGO for the<br>

>>> following reasons:<br>

>>>    * This slows down already lengthy build time. In the worst case, the<br>

>>> instrumented binary is so slow that it fails to run a representative<br>

>>> workload, because slow execution can leads to more time-outs in many server<br>

>>> programs. Other typical issues include: text size too big, failure to link<br>

>>> instrumented binaries, memory usage exceeding system limits.<br>

>>>    * Slow runtime affects the program behavior. Real applications<br>

>>> sometimes monitor the program runtime and have different execution path when<br>

>>> the program run too slow. This would defeat the underlying assumption of PGO<br>

>>> and make it less effective.<br>

>>><br>

>>> This work proposes an option to turn on middle-end based instrumentation<br>

>>> (new) that aims to speed up instrumentation runtime. The new instrumentation<br>

>>> is referred to as ME based instrumentation in this document. Our<br>

>>> experimental results show that ME instrumentation can speed up the<br>

>>> instrumentation by 80% on average for typical C++ programs. Here are the two<br>

>>> main design objectives:<br>

>>>    * Co-existing with FE Instrumenter: We do not propose to replace the<br>

>>> FE based instrumentation because FE based instrumentation has its advantages<br>

>>> and applications. User can choose which phase to do instrumentation via<br>

>>> command line options.<br>

>>>    * PGO Runtime Support Sharing: The ME instrumenter will completely<br>

>>> re-use the existing PGO’s runtime support.<br>

>>><br>

>>> 1. FE Based Instrumentation Runtime Overhead Analysis<br>

>>><br>

>>> Instrumented binaries are expected to run slower due to instrumentation<br>

>>> code inserted. With FE based instrumentation, the overhead is especially<br>

>>> high and runtime slowdown can be unacceptable in many cases.  Further<br>

>>> analysis shows that there are 3  important factors  contributing to FE<br>

>>> instrumentation slowdown :<br>

>>>    * [Main] Redundant counter updates of inlined functions. C++ programs<br>

>>> can introduce large abstraction penalties by using lots of small inline<br>

>>> functions (assignment operators, getters, setters, ctors/dtors etc).<br>

>>> Overhead of instrumenting those small functions can be very large, making<br>

>>> training runs too slow and in some cases to usable;<br>

>>>    * Non-optimal placement of the count updates;<br>

>>>    * A third factor is related value profiling (to be turned on in the<br>

>>> future). Small and hot callee functions taking function pointer (callbacks)<br>

>>> can incur  overhead due to indirect call target profiling.<br>

>>><br>

>>><br>

>>> 1.1 Redundant Counter Update<br>

>>><br>

>>> If checking the assembly of the instrumented binary generated by current<br>

>>> LLVM implementation, we can find many sequence of consecutive 'incq'<br>

>>> instructions that updating difference counters in the same basic block. As<br>

>>> an example that extracted from real binary:<br>

>>>   ...<br>

>>>  incq   0xa91d80(%rip)        # 14df4b8<br>

>>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8><br>

>>>  incq   0xa79011(%rip)        # 14c6750<br>

>>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm><br>

>>>  incq   0xa79442(%rip)        # 14c6b88<br>

>>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv><br>

>>>  incq   0x9c288b(%rip)        # 140ffd8<br>

>>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl><br>

>>>  ...<br>

>>><br>

>>> From profile use point of view, many of these counter update are<br>

>>> redundant. Considering the following example:<br>

>>> void bar(){<br>

>>>  sum++;<br>

>>> }<br>

>>> void foo() {<br>

>>>  bar();<br>

>>> }<br>

>>><br>

>>> FE based instrumentation needs to insert counter update for the only BB<br>

>>> of the bar().<br>

>>> bar:                                    # @bar<br>

>>> # BB#0:                                 # %entry<br>

>>>        incq    .L__llvm_profile_counters_bar(%rip)<br>

>>>        incl    sum(%rip)<br>

>>>        retq<br>

>>><br>

>>> It also need to insert the update the BB in function foo().  After<br>

>>> inlining bar to foo(), the code is:<br>

>>> foo:                                    # @foo<br>

>>> # BB#0:                                 # %entry<br>

>>>        incq    .L__llvm_profile_counters_foo(%rip)<br>

>>>        incq    .L__llvm_profile_counters_bar(%rip)<br>

>>>        incl    sum(%rip)<br>

>>>        retq<br>

>>><br>

>>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is<br>

>>> redundant -- the counter won't help downstream optimizations. On the other<br>

>>> hand, if bar() is a large function and may not be suitable to be inlined for<br>

>>> all callsites, this counter updated is necessary in order to produce more<br>

>>> accurate profile data for the out-of-line instance of the callee.<br>

>>><br>

>>> If foo() is a hot function, the overhead of updating two counters can be<br>

>>> significant. This is especially bad for C++ program where there are tons of<br>

>>> small inline functions.<br>

>>><br>

>>> There is another missing opportunity in FE based instrumentation. The<br>

>>> small functions’ control flow can usually be simplified when they are<br>

>>> inlined into caller contexts. Once the control flow is simplified, many<br>

>>> counter updates can therefore be eliminated. This is only possible for a<br>

>>> middle end based late instrumenter. Defining a custom clean-up pass to<br>

>>> remove redundant counter update is unrealistic and cannot be done in a sane<br>

>>> way without destroying the profile integrity of neither the out-of-line nor<br>

>>> inline instances of the callee.<br>

>>><br>

>>> A much simpler and cleaner solution is to do a pre-inline pass to inline<br>

>>> all the trivial inlines before instrumentation.  In addition to removing the<br>

>>> unnecessary count updates for the inline instances,  another advantage of<br>

>>> pre-inline is to  provide context sensitive profile for these small inlined<br>

>>> functions. This context senstive profile can further improve the PGO based<br>

>>> optimizations. Here is a contrived example:<br>

>>> void bar (int n) {<br>

>>>   if (n&1)<br>

>>>     do_sth1();<br>

>>>   else<br>

>>>     do_sth2();<br>

>>> }<br>

>>><br>

>>> void caller() {<br>

>>>   int s = 1;<br>

>>>   for (; s<100; s+=2)<br>

>>>     bar(s);<br>

>>><br>

>>>   for (s = 102; s< 200; s+=2)<br>

>>>     bar(s);<br>

>>> }<br>

>>><br>

>>> The direction of the branch inside bar will be totally opposite between<br>

>>> two different callsites in ‘caller’. Without pre-inlining, the branch<br>

>>> probability will be 50-50 which will be useless for later optimizations.<br>

>>> With pre-inlining, the profile will have the perfect branch count for each<br>

>>> callsite. The positive performance impact of context sensitive profiling due<br>

>>> to pre-inlining has been observed in real world large C++ programs.<br>

>>> Supporting context sensitive profiling is another way to solve this, but it<br>

>>> will introduce large additional runtime/memory overhead.<br>

>>><br>

>>><br>

>>> 1.2 Non-optimal placement of count update<br>

>>><br>

>>> Another much smaller showdown factor is the placement of the counter<br>

>>> updates. Current front-end based instrumentation applies the instrumentation<br>

>>> to each front-end lexical construct. It also minimizes the number of static<br>

>>> instrumentations. Note that it always instruments the entry count of the<br>

>>> CFG. This may result in higher dynamic instruction counts. For example,<br>

>>>      BB0<br>

>>>      | 100<br>

>>>     BB1<br>

>>> 90 /   \ 10<br>

>>>   BB2  BB3<br>

>>> 90 \   / 10<br>

>>>     BB4<br>

>>> Like the the above example, FE based instrumentation will always insert<br>

>>> count update in BB0.  The dynamic instrumentation count will be either 110<br>

>>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better<br>

>>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic<br>

>>> instrumentation count is 100.<br>

>>><br>

>>> Our experimental shows that the optimal placement based on edge hotness<br>

>>> can improve instrumented code performance by about 10%.  While it’s hard to<br>

>>> find the optimal placement of count update,  compiler heuristics can be used<br>

>>> the get the better placement. These heuristics  can be based on static<br>

>>> profile prediction or user annotations (like __buildin_expect) to  estimate<br>

>>> the relative edge hotness and put instrumentations on the less hot edges.<br>

>>> The initial late instrumentation has not fully implemented this placement<br>

>>> strategy yet.  With that implemented, we expect even better results than<br>

>>> what is reported here. For real world programs, another major source of the<br>

>>> slowdown is the data racing and false sharing of the counter update for<br>

>>> highly threaded programs. Pre-inlining can alleviate this problem as the<br>

>>> counters in the inline instances are not longer shared. But the complete<br>

>>> solution to the data racing issue is orthogonal to the problem we try to<br>

>>> solve here.<br>

>>><br>

>>><br>

>>> 2. High Level Design<br>

>>><br>

>>> We propose to perform a pre-profile inline pass before the PGO<br>

>>> instrumentation pass. Since the instrumentation pass is after inine, it has<br>

>>> to be done in the middle-end.<br>

>>><br>

>>> (1) The pre-inline pass<br>

>>> We will invoke a pre-inline pass before the instrumentation. When PGO is<br>

>>> on, the inlining will be split into two passes:<br>

>>>    * A pre-inline pass that is scheduled before the profile<br>

>>> instrumentation/annotation<br>

>>>    * A post-inline pass which is the regular inline pass after<br>

>>> instrumentation/annotation<br>

>>> By design, all beneficial callsites without requiring profile data should<br>

>>> be inlined in the pre-inline pass. It includes all callsites that will<br>

>>> shrink code size after inlining. All the remaining callsites will be left to<br>

>>> the regular inline pass when profile data is available.<br>

>>><br>

>>> After pre-inline, a CFG based profile instrumentation/annotation will be<br>

>>> done. A minimum weight spanning tree (MST) in CFG is first computed, then<br>

>>> only the edges not in the MST will be instrumented. The counter update<br>

>>> instructions are placed in the basic blocks.<br>

>>><br>

>>> (2) Minimum Spanning Tree (MST) based instrumentation<br>

>>> A native way of instrumentation is to insert a count update for every<br>

>>> edge in CFG which will result in  too many redundant updates that makes the<br>

>>> runtime very slow. Knuth [1] proposed a minimum spanning tree based method:<br>

>>> given a CFG, first compute a spanning tree. All edges that not in the MST<br>

>>> will be instrumented. In the profile use compilation, the counters are<br>

>>> populated (from the leaf of the spanning tree) to all the edges. Knuth<br>

>>> proved this method inserts the minimum number of instrumentations. MST based<br>

>>> method only guarantees the number static instrumentation are minimized, not<br>

>>> the dynamic instance of instrumentation. To reduce the number of dynamic<br>

>>> instrumentation, edges of potentially high counts will be put into MST first<br>

>>> so that they will have less chance to be instrumented.<br>

>>><br>

>>><br>

>>> 3. Experimental Results<br>

>>><br>

>>> 3.1 Measurement of the efficiency of instrumentation<br>

>>> Other than the runtime of the instrumented binaries, a more direct<br>

>>> measurement of the instrumentation overhead is the the sum of the raw<br>

>>> profile count values. Note that regardless what kind of instrumentations are<br>

>>> used, the raw profile count should be able to reconstruct all the edge count<br>

>>> values for the whole program. All raw profile value are obtained via<br>

>>> incrementing the counter variable value by one. The sum of the raw profile<br>

>>> count value is roughly the dynamic instruction count of the instrumented<br>

>>> code. The lower of the value, the more efficient of the instrumentation.<br>

>>><br>

>>><br>

>>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K<br>

>>> eon<br>

>>> The performance speedup is computed by (FE_instrumentation_runtime /<br>

>>> ME_instrumentation_runtime - 1)<br>

>>><br>

>>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from<br>

>>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small<br>

>>> ups and downs across different programs. Late instrumentation improves hmmer<br>

>>> a lot, but it is probably due to unrelated loop optimizations (90% of<br>

>>> runtime spent in one loop nest).<br>

>>><br>

>>> For C++ programs, the performance impact of late instrumentation is very<br>

>>> large, which is as expected. The following table shows the result.  For some<br>

>>> C++ program, the time speedup is huge. For example, in  483.xalancbmk, late<br>

>>> instrumentation speeds up performance by ~60%.  Among all the SPEC C++<br>

>>> programs, only 444.namd is an outlier -- it uses a lot of macros and is a<br>

>>> very C like program.<br>

>>><br>

>>> Program           Speedup<br>

>>> 471.omnetpp       16.03%<br>

>>> 473.astar          5.00%<br>

>>> 483.xalancbmk     58.57%<br>

>>> 444.namd          -0.90%<br>

>>> 447.dealII        60.47%<br>

>>> 450.soplex         8.20%<br>

>>> 453.povray        11.34%<br>

>>> 252.eon           35.33%<br>

>>> -------------------------<br>

>>> Geomean           21.01%<br>

>>><br>

>>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs<br>

>>> We also collect some statistic of the profiles generated by FE based<br>

>>> instrumentation and late instrumentation, namely, the following information:<br>

>>>    1. the number of functions that being instrumented,<br>

>>>    2. the result profile file size,<br>

>>>    3. the sum of raw count values that was mentioned earlier -- we used<br>

>>> it to measure the efficiency of the instrumentation.<br>

>>> Next table shows the ratios of the each metrics by late instrumentation<br>

>>> for the C++ programs, with FE based instrumentation as the base : column (1)<br>

>>> shows the ratios of instrumented functions; column (2) shows the ratios of<br>

>>> the profile file size; column (3) shows the ratios of the sum of raw count<br>

>>> values.<br>

>>><br>

>>>                 (1)       (2)       (3)<br>

>>> 471.omnetpp    85.36%   110.26%    46.52%<br>

>>> 473.astar      64.86%    72.72%    63.13%<br>

>>> 483.xalancbmk  51.83%    56.11%    35.77%<br>

>>> 444.namd       75.36%    82.82%    85.77%<br>

>>> 447.dealII     43.42%    46.46%    26.75%<br>

>>> 450.soplex     71.80%    87.54%    51.19%<br>

>>> 453.povray     78.68%    83.57%    64.37%<br>

>>> 252.eon        72.06%    91.22%    30.02%<br>

>>> ----------------------------------------<br>

>>> Geomean        66.50%    76.36%    47.01%<br>

>>><br>

>>><br>

>>> For FE based instrumentation, profile count variables generated for the<br>

>>> dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data,<br>

>>> and __llvm_prf_cnts) from the data/text segment, nor in the result profile.<br>

>>> There is a recent patch that removes these unused data for COMDAT functions,<br>

>>> but that patch won’t touch regular functions. This is the main reason for<br>

>>> the larger number of instrumented functions and larger profile file size for<br>

>>> the FE based instrumentation. The reduction of the sum of raw count values<br>

>>> is mainly due to the elimination of redundant profile updates enabled by the<br>

>>> pre-inlining.<br>

>>><br>

>>> For C programs, we observe similar improvement in the profile size<br>

>>> (geomean ratio of 73.75%) and smaller improvement in the number of<br>

>>> instrumented functions (geomean ratio of 87.49%) and the sum of raw count<br>

>>> values (geomean of 82.76%).<br>

>>><br>

>>><br>

>>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++<br>

>>> benchmarks<br>

>>><br>

>>> We also use Google internal benchmarks (mostly typical C++ applications)<br>

>>> to measure the relative performance b/w FE based instrumentation and late<br>

>>> instrumentation.  The following table shows the speedup of late<br>

>>> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a<br>

>>> very large multi-threaded C++ program. Late instrumentation sees 4x speedup.<br>

>>> Larger than 3x speedups are also seen in many other programs.<br>

>>><br>

>>> C++_bencharmk01    416.98%<br>

>>> C++_bencharmk02      6.29%<br>

>>> C++_bencharmk03     22.39%<br>

>>> C++_bencharmk04     28.05%<br>

>>> C++_bencharmk05      2.00%<br>

>>> C++_bencharmk06    675.89%<br>

>>> C++_bencharmk07    359.19%<br>

>>> C++_bencharmk08    395.03%<br>

>>> C_bencharmk09       15.11%<br>

>>> C_bencharmk10        5.47%<br>

>>> C++_bencharmk11      5.73%<br>

>>> C++_bencharmk12      2.31%<br>

>>> C++_bencharmk13     87.73%<br>

>>> C++_bencharmk14      7.22%<br>

>>> C_bencharmk15       -0.51%<br>

>>> C++_bencharmk16     59.15%<br>

>>> C++_bencharmk17    358.82%<br>

>>> C++_bencharmk18    861.36%<br>

>>> C++_bencharmk19     29.62%<br>

>>> C++_bencharmk20     11.82%<br>

>>> C_bencharmk21        0.53%<br>

>>> C++_bencharmk22     43.10%<br>

>>> ---------------------------<br>

>>> Geomean             83.03%<br>

>>><br>

>>><br>

>>> 3.5 Statistics of LLVM profiles for Google internal benchmarks<br>

>>><br>

>>> The following shows the profile statics using Google internal benchmarks.<br>

>>>                          (1)       (2)       (3)<br>

>>> C++_bencharmk01         36.84%    40.29%     2.32%<br>

>>> C++_bencharmk02         39.20%    40.49%    42.39%<br>

>>> C++_bencharmk03         39.37%    40.65%    23.24%<br>

>>> C++_bencharmk04         39.13%    40.68%    17.70%<br>

>>> C++_bencharmk05         36.58%    38.27%    51.08%<br>

>>> C++_bencharmk06         29.50%    27.87%     2.87%<br>

>>> C++_bencharmk07         29.50%    27.87%     1.73%<br>

>>> C++_bencharmk08         29.50%    27.87%     4.17%<br>

>>> C_bencharmk09           53.95%    68.00%    11.18%<br>

>>> C_bencharmk10           53.95%    68.00%    31.74%<br>

>>> C++_bencharmk11         36.40%    37.07%    46.12%<br>

>>> C++_bencharmk12         38.44%    41.90%    73.59%<br>

>>> C++_bencharmk13         39.28%    42.72%    29.56%<br>

>>> C++_bencharmk14         38.59%    42.20%    13.42%<br>

>>> C_bencharmk15           57.45%    48.50%    66.99%<br>

>>> C++_bencharmk16         36.86%    42.18%    16.53%<br>

>>> C++_bencharmk17         37.82%    39.77%    13.68%<br>

>>> C++_bencharmk18         37.82%    39.77%     7.96%<br>

>>> C++_bencharmk19         37.52%    40.46%     1.85%<br>

>>> C++_bencharmk20         32.37%    30.44%    19.69%<br>

>>> C_bencharmk21           37.63%    40.42%    88.81%<br>

>>> C++_bencharmk22         36.28%    36.92%    21.62%<br>

>>> --------------------------------------------------<br>

>>> Geomean                 38.22%    39.96%    15.58%<br>

>>><br>

>>><br>

>>> 4. Implementation Details:<br>

>>><br>

>>> We need to add new option(s) for the alternative PGO instrumentation pass<br>

>>> in the middle end. It can in one of the following forms:<br>

>>><br>

>>>    (1) Complete new options that are on par with current PGO options:<br>

>>> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation, and<br>

>>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE.<br>

>>>    (2) Or, late instrumentation can be turned on with an additional<br>

>>> option -fprofile-instr-late with current PGO options. I. e.<br>

>>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO<br>

>>> instrumentation, and -fprofile-instr-late<br>

>>> -fprofile-instr-use[=<profile_file>]? for PGO use.<br>

>>>    (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO<br>

>>> instrumentation. Adding a magic tag in profile so that FE based profile and<br>

>>> late instrumented profile can be automatically detected by profile loader In<br>

>>> PGO use compilation. This requires a slight profile format change.<br>

>>><br>

>>> In our prototype implementation, two new passes are added in the<br>

>>> beginning of PassManagerBuilder::populateModulePassManager(), namely<br>

>>> PreProfileInlinerPass and PGOLateInstrumentationPass.<br>

>>><br>

>>><br>

>>> 4.1 Pre-inline pass:<br>

>>><br>

>>> It is controlled by back-end option "-preinline" and<br>

>>> "-disable-preinline". If the user specifies any llvm option of<br>

>>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be<br>

>>> automatically inserted in the driver.. To disable the pre-inliner when late<br>

>>> instrumentation is enabled, use option "-mllvm -disable-preinline".<br>

>>><br>

>>> For now, only minimum tuning is done for the pre-inliner, which simply<br>

>>> adjusts the inline threshold: If -Oz is specified, the threshold is set to<br>

>>> 25. Otherwise, it is 75.<br>

>>><br>

>>> The following clean up passes are added to PassManager, right after the<br>

>>> PreProfileInline pass:<br>

>>>   createEarlyCSEPass()<br>

>>>   createJumpThreadingPass()<br>

>>>   createCorrelatedValuePropagationPass()<br>

>>>   createCFGSimplificationPass()<br>

>>>   createInstructionCombiningPass()<br>

>>>   createGVNPass(DisableGVNLoadPRE)<br>

>>>   createPeepholePASS()<br>

>>> Some of them might not be necessary.<br>

>>><br>

>>> 4.2 Late Instrumentation Pass:<br>

>>> The late instrumentation is right after the pre-inline pass and it's<br>

>>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and<br>

>>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will<br>

>>> provide the profile name.<br>

>>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass()<br>

>>> is also added to PassManager to lower the instrumentation intrinsics<br>

>>><br>

>>> PGOLateInstrumeatnion is a module pass that applies the instrumentation<br>

>>> to each function by class PGOLateInstrumentationFunc. For each function,<br>

>>> perform the following steps:<br>

>>>    1. First collect all the CFG edges. Assign an estimated weight to each<br>

>>> edge. Critical edges and back-edges are assigned to high value of weights.<br>

>>> One fake node and a few fake edges (from the fake node to the entry node,<br>

>>> and from all the exit nodes to the fake node) are also added to the<br>

>>> worklist.<br>

>>>    2. Construct the MST. The edges with the higher weight will be put to<br>

>>> MST first, unless it forms a cycle.<br>

>>>    3. Traverse the CFG and compute the CFG hash using CRC32 of the index<br>

>>> of each BB.<br>

>>> The above three steps are the same for profile-generate and profile-use<br>

>>> compilation.<br>

>>><br>

>>> In the next step, for profile-generation compilation, all the edges that<br>

>>> not in the MST are instrumented. If this is a critical edge, split the edge<br>

>>> first. The actual instrumentation is to generate<br>

>>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will<br>

>>> be lowed by pass createInstrProfilingPass().<br>

>>><br>

>>> In the next step, for profile-generation compilation, all the edges that<br>

>>> not in the MST are instrumented. If this is a critical edge, split the edge<br>

>>> first. The actual instrumentation is to generate<br>

>>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will<br>

>>> be lowed by pass createInstrProfilingPass().<br>

>>><br>

>>> For -fprofile-use compilation, first read in the counters and the CFG<br>

>>> hash from the profile file. If the CFG hash matches, populate the counters<br>

>>> to all the edges in reverse topological order of the MST. Once having all<br>

>>> the edge counts, set the branch weights metadata for the IR having multiple<br>

>>> branches. Also apply the cold/hot function attributes based on function<br>

>>> level counts.<br>

>>><br>

>>><br>

>>> 4.3 Profile Format:<br>

>>><br>

>>> The late instrumentation profile is mostly the same as the one from<br>

>>> front-end instrument-ion. The difference is<br>

>>>    * Function checksums are different.<br>

>>>    * Function entry counts are no longer available.<br>

>>> For llvm-profdata utility, options -lateinstr needs to be used to<br>

>>> differentiate FE based and late instrumentation profiles, unless a magic tag<br>

>>> is added to the profile.<br>

>>><br>

>>><br>

>>> 5. References:<br>

>>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points<br>

>>> for program frequency counts. BIT Numerical Mathematics 1973, Volume 13,<br>

>>> Issue 3, pp 313-322<br>

>>><br>

>><br>

><br>

</div></div></blockquote></div></div></div><br></div></div>

</blockquote></div><br></div></div>