[llvm-dev] RFC: PGO Late instrumentation for LLVM

Sat Aug 8 06:31:55 PDT 2015

On Fri, Aug 7, 2015 at 10:56 PM, Sean Silva <chisophugis at gmail.com> wrote:
> Accidentally sent to uiuc server.
>
>
> On Fri, Aug 7, 2015 at 10:49 PM, Sean Silva <chisophugis at gmail.com> wrote:
>>
>> Can you compare your results with another approach: simply do not
>> instrument the top 1% hottest functions (by function entry count)? If this
>> simple approach provides most of the benefits (my measurements on one
>> codebase I tested show that it would eliminate over 97% of total function
>> counts), we may be able to use a simpler approach.AFor large C++ programs, the

For static compiler, this is not possible. It also seem to defeat the
purpose of PGO -- hottest functions are those which need profile
guidance the most.

>>
>> The biggest thing I notice about this proposal is that although the focus
>> of the proposal is on reducing profiling overhead, it requires using a
>> different pass pipeline during *both* the "instr-generate" compilation and
>> the "instr-use" compilation. Therefore it is more than simply a reduction in
>> profiling overhead -- it can also change performance even ignoring adding
>> the profiling information in the IR. I think it would be interesting to test
>> the modified pass pipeline even in the absence of profile information to
>> better understand its effects in isolation (for example, maybe it would be
>> good to add these passes during -O3 regardless of whether we are doing PGO).
>>

The pipeline change is very PGO specific. I expect it has very little
impact on regular compilations:
1) LLVM's bottom up inliner is already iterative.
2) The performance impact (on instrumented build) can be 4x large --
which is unlikely for any nonPGO pipeline change.

LLVM already supports running SCC passes iteratively, so experiment
like this will be easy to do -- the data can be collected.

thanks,

David

>> -- Sean Silva
>>
>> On Fri, Aug 7, 2015 at 4:54 PM, Rong Xu <xur at google.com> wrote:
>>>
>>> Instrumentation based Profile Guided Optimization (PGO) is a compiler
>>> technique that leverages important program runtime information, such as
>>> precise edge counts and frequent value information, to make frequently
>>> executed code run faster. It's proven to be one of the most effective ways
>>> to improve program performance.
>>>
>>> An important design point of PGO is to decide where to place the
>>> instrumentation. In current LLVM PGO, the instrumentation is done in the
>>> Clang Front-End (will be referred as FE based instrumentation). Doing this
>>> early in the front-end gives better coverage information as the there are
>>> more precise line number information. The resulted profile also has
>>> relatively high tolerance to compiler changes. All the compiler change after
>>> the instrumentation point will not lead to mismatched IR that invalidates
>>> the profile.
>>>
>>> On the other hand, doing this too early gives the compiler fewer
>>> opportunities to optimize the code before instrumentation. It has
>>> significant impact on instrumentation runtime performance. In addition, it
>>> tends to produce larger binary and  profile data file.  Our internal C++
>>> benchmarks shows that FE based instrumentation degrades the performance
>>> (compared to non-instrumented version) by 58.3%, and in some extreme cases,
>>> the application speed/throughput decreases by 95% (21x runtime slowdown).
>>>
>>> Running the instrumented binary too slow is not desirable in PGO for the
>>> following reasons:
>>>    * This slows down already lengthy build time. In the worst case, the
>>> instrumented binary is so slow that it fails to run a representative
>>> workload, because slow execution can leads to more time-outs in many server
>>> programs. Other typical issues include: text size too big, failure to link
>>> instrumented binaries, memory usage exceeding system limits.
>>>    * Slow runtime affects the program behavior. Real applications
>>> sometimes monitor the program runtime and have different execution path when
>>> the program run too slow. This would defeat the underlying assumption of PGO
>>> and make it less effective.
>>>
>>> This work proposes an option to turn on middle-end based instrumentation
>>> (new) that aims to speed up instrumentation runtime. The new instrumentation
>>> is referred to as ME based instrumentation in this document. Our
>>> experimental results show that ME instrumentation can speed up the
>>> instrumentation by 80% on average for typical C++ programs. Here are the two
>>> main design objectives:
>>>    * Co-existing with FE Instrumenter: We do not propose to replace the
>>> FE based instrumentation because FE based instrumentation has its advantages
>>> and applications. User can choose which phase to do instrumentation via
>>> command line options.
>>>    * PGO Runtime Support Sharing: The ME instrumenter will completely
>>> re-use the existing PGO’s runtime support.
>>>
>>> 1. FE Based Instrumentation Runtime Overhead Analysis
>>>
>>> Instrumented binaries are expected to run slower due to instrumentation
>>> code inserted. With FE based instrumentation, the overhead is especially
>>> high and runtime slowdown can be unacceptable in many cases.  Further
>>> analysis shows that there are 3  important factors  contributing to FE
>>> instrumentation slowdown :
>>>    * [Main] Redundant counter updates of inlined functions. C++ programs
>>> can introduce large abstraction penalties by using lots of small inline
>>> functions (assignment operators, getters, setters, ctors/dtors etc).
>>> Overhead of instrumenting those small functions can be very large, making
>>> training runs too slow and in some cases to usable;
>>>    * Non-optimal placement of the count updates;
>>>    * A third factor is related value profiling (to be turned on in the
>>> future). Small and hot callee functions taking function pointer (callbacks)
>>> can incur  overhead due to indirect call target profiling.
>>>
>>>
>>> 1.1 Redundant Counter Update
>>>
>>> If checking the assembly of the instrumented binary generated by current
>>> LLVM implementation, we can find many sequence of consecutive 'incq'
>>> instructions that updating difference counters in the same basic block. As
>>> an example that extracted from real binary:
>>>   ...
>>>  incq   0xa91d80(%rip)        # 14df4b8
>>> <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
>>>  incq   0xa79011(%rip)        # 14c6750
>>> <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
>>>  incq   0xa79442(%rip)        # 14c6b88
>>> <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
>>>  incq   0x9c288b(%rip)        # 140ffd8
>>> <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
>>>  ...
>>>
>>> From profile use point of view, many of these counter update are
>>> redundant. Considering the following example:
>>> void bar(){
>>>  sum++;
>>> }
>>> void foo() {
>>>  bar();
>>> }
>>>
>>> FE based instrumentation needs to insert counter update for the only BB
>>> of the bar().
>>> bar:                                    # @bar
>>> # BB#0:                                 # %entry
>>>        incq    .L__llvm_profile_counters_bar(%rip)
>>>        incl    sum(%rip)
>>>        retq
>>>
>>> It also need to insert the update the BB in function foo().  After
>>> inlining bar to foo(), the code is:
>>> foo:                                    # @foo
>>> # BB#0:                                 # %entry
>>>        incq    .L__llvm_profile_counters_foo(%rip)
>>>        incq    .L__llvm_profile_counters_bar(%rip)
>>>        incl    sum(%rip)
>>>        retq
>>>
>>> If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
>>> redundant -- the counter won't help downstream optimizations. On the other
>>> hand, if bar() is a large function and may not be suitable to be inlined for
>>> all callsites, this counter updated is necessary in order to produce more
>>> accurate profile data for the out-of-line instance of the callee.
>>>
>>> If foo() is a hot function, the overhead of updating two counters can be
>>> significant. This is especially bad for C++ program where there are tons of
>>> small inline functions.
>>>
>>> There is another missing opportunity in FE based instrumentation. The
>>> small functions’ control flow can usually be simplified when they are
>>> inlined into caller contexts. Once the control flow is simplified, many
>>> counter updates can therefore be eliminated. This is only possible for a
>>> middle end based late instrumenter. Defining a custom clean-up pass to
>>> remove redundant counter update is unrealistic and cannot be done in a sane
>>> way without destroying the profile integrity of neither the out-of-line nor
>>> inline instances of the callee.
>>>
>>> A much simpler and cleaner solution is to do a pre-inline pass to inline
>>> all the trivial inlines before instrumentation.  In addition to removing the
>>> unnecessary count updates for the inline instances,  another advantage of
>>> pre-inline is to  provide context sensitive profile for these small inlined
>>> functions. This context senstive profile can further improve the PGO based
>>> optimizations. Here is a contrived example:
>>> void bar (int n) {
>>>   if (n&1)
>>>     do_sth1();
>>>   else
>>>     do_sth2();
>>> }
>>>
>>> void caller() {
>>>   int s = 1;
>>>   for (; s<100; s+=2)
>>>     bar(s);
>>>
>>>   for (s = 102; s< 200; s+=2)
>>>     bar(s);
>>> }
>>>
>>> The direction of the branch inside bar will be totally opposite between
>>> two different callsites in ‘caller’. Without pre-inlining, the branch
>>> probability will be 50-50 which will be useless for later optimizations.
>>> With pre-inlining, the profile will have the perfect branch count for each
>>> callsite. The positive performance impact of context sensitive profiling due
>>> to pre-inlining has been observed in real world large C++ programs.
>>> Supporting context sensitive profiling is another way to solve this, but it
>>> will introduce large additional runtime/memory overhead.
>>>
>>>
>>> 1.2 Non-optimal placement of count update
>>>
>>> Another much smaller showdown factor is the placement of the counter
>>> updates. Current front-end based instrumentation applies the instrumentation
>>> to each front-end lexical construct. It also minimizes the number of static
>>> instrumentations. Note that it always instruments the entry count of the
>>> CFG. This may result in higher dynamic instruction counts. For example,
>>>      BB0
>>>      | 100
>>>     BB1
>>> 90 /   \ 10
>>>   BB2  BB3
>>> 90 \   / 10
>>>     BB4
>>> Like the the above example, FE based instrumentation will always insert
>>> count update in BB0.  The dynamic instrumentation count will be either 110
>>> (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better
>>> instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic
>>> instrumentation count is 100.
>>>
>>> Our experimental shows that the optimal placement based on edge hotness
>>> can improve instrumented code performance by about 10%.  While it’s hard to
>>> find the optimal placement of count update,  compiler heuristics can be used
>>> the get the better placement. These heuristics  can be based on static
>>> profile prediction or user annotations (like __buildin_expect) to  estimate
>>> the relative edge hotness and put instrumentations on the less hot edges.
>>> The initial late instrumentation has not fully implemented this placement
>>> strategy yet.  With that implemented, we expect even better results than
>>> what is reported here. For real world programs, another major source of the
>>> slowdown is the data racing and false sharing of the counter update for
>>> highly threaded programs. Pre-inlining can alleviate this problem as the
>>> counters in the inline instances are not longer shared. But the complete
>>> solution to the data racing issue is orthogonal to the problem we try to
>>> solve here.
>>>
>>>
>>> 2. High Level Design
>>>
>>> We propose to perform a pre-profile inline pass before the PGO
>>> instrumentation pass. Since the instrumentation pass is after inine, it has
>>> to be done in the middle-end.
>>>
>>> (1) The pre-inline pass
>>> We will invoke a pre-inline pass before the instrumentation. When PGO is
>>> on, the inlining will be split into two passes:
>>>    * A pre-inline pass that is scheduled before the profile
>>> instrumentation/annotation
>>>    * A post-inline pass which is the regular inline pass after
>>> instrumentation/annotation
>>> By design, all beneficial callsites without requiring profile data should
>>> be inlined in the pre-inline pass. It includes all callsites that will
>>> shrink code size after inlining. All the remaining callsites will be left to
>>> the regular inline pass when profile data is available.
>>>
>>> After pre-inline, a CFG based profile instrumentation/annotation will be
>>> done. A minimum weight spanning tree (MST) in CFG is first computed, then
>>> only the edges not in the MST will be instrumented. The counter update
>>> instructions are placed in the basic blocks.
>>>
>>> (2) Minimum Spanning Tree (MST) based instrumentation
>>> A native way of instrumentation is to insert a count update for every
>>> edge in CFG which will result in  too many redundant updates that makes the
>>> runtime very slow. Knuth [1] proposed a minimum spanning tree based method:
>>> given a CFG, first compute a spanning tree. All edges that not in the MST
>>> will be instrumented. In the profile use compilation, the counters are
>>> populated (from the leaf of the spanning tree) to all the edges. Knuth
>>> proved this method inserts the minimum number of instrumentations. MST based
>>> method only guarantees the number static instrumentation are minimized, not
>>> the dynamic instance of instrumentation. To reduce the number of dynamic
>>> instrumentation, edges of potentially high counts will be put into MST first
>>> so that they will have less chance to be instrumented.
>>>
>>>
>>> 3. Experimental Results
>>>
>>> 3.1 Measurement of the efficiency of instrumentation
>>> Other than the runtime of the instrumented binaries, a more direct
>>> measurement of the instrumentation overhead is the the sum of the raw
>>> profile count values. Note that regardless what kind of instrumentations are
>>> used, the raw profile count should be able to reconstruct all the edge count
>>> values for the whole program. All raw profile value are obtained via
>>> incrementing the counter variable value by one. The sum of the raw profile
>>> count value is roughly the dynamic instruction count of the instrumented
>>> code. The lower of the value, the more efficient of the instrumentation.
>>>
>>>
>>> 3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K
>>> eon
>>> The performance speedup is computed by (FE_instrumentation_runtime /
>>> ME_instrumentation_runtime - 1)
>>>
>>> We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from
>>> SPEC2000. For C programs, except for one outlier 456.hmmer, there are small
>>> ups and downs across different programs. Late instrumentation improves hmmer
>>> a lot, but it is probably due to unrelated loop optimizations (90% of
>>> runtime spent in one loop nest).
>>>
>>> For C++ programs, the performance impact of late instrumentation is very
>>> large, which is as expected. The following table shows the result.  For some
>>> C++ program, the time speedup is huge. For example, in  483.xalancbmk, late
>>> instrumentation speeds up performance by ~60%.  Among all the SPEC C++
>>> programs, only 444.namd is an outlier -- it uses a lot of macros and is a
>>> very C like program.
>>>
>>> Program           Speedup
>>> 471.omnetpp       16.03%
>>> 473.astar          5.00%
>>> 483.xalancbmk     58.57%
>>> 444.namd          -0.90%
>>> 447.dealII        60.47%
>>> 450.soplex         8.20%
>>> 453.povray        11.34%
>>> 252.eon           35.33%
>>> -------------------------
>>> Geomean           21.01%
>>>
>>> 3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
>>> We also collect some statistic of the profiles generated by FE based
>>> instrumentation and late instrumentation, namely, the following information:
>>>    1. the number of functions that being instrumented,
>>>    2. the result profile file size,
>>>    3. the sum of raw count values that was mentioned earlier -- we used
>>> it to measure the efficiency of the instrumentation.
>>> Next table shows the ratios of the each metrics by late instrumentation
>>> for the C++ programs, with FE based instrumentation as the base : column (1)
>>> shows the ratios of instrumented functions; column (2) shows the ratios of
>>> the profile file size; column (3) shows the ratios of the sum of raw count
>>> values.
>>>
>>>                 (1)       (2)       (3)
>>> 471.omnetpp    85.36%   110.26%    46.52%
>>> 473.astar      64.86%    72.72%    63.13%
>>> 483.xalancbmk  51.83%    56.11%    35.77%
>>> 444.namd       75.36%    82.82%    85.77%
>>> 447.dealII     43.42%    46.46%    26.75%
>>> 450.soplex     71.80%    87.54%    51.19%
>>> 453.povray     78.68%    83.57%    64.37%
>>> 252.eon        72.06%    91.22%    30.02%
>>> ----------------------------------------
>>> Geomean        66.50%    76.36%    47.01%
>>>
>>>
>>> For FE based instrumentation, profile count variables generated for the
>>> dead functions will not be removed (like __llvm_prf_names, __llvm_prf_data,
>>> and __llvm_prf_cnts) from the data/text segment, nor in the result profile.
>>> There is a recent patch that removes these unused data for COMDAT functions,
>>> but that patch won’t touch regular functions. This is the main reason for
>>> the larger number of instrumented functions and larger profile file size for
>>> the FE based instrumentation. The reduction of the sum of raw count values
>>> is mainly due to the elimination of redundant profile updates enabled by the
>>> pre-inlining.
>>>
>>> For C programs, we observe similar improvement in the profile size
>>> (geomean ratio of 73.75%) and smaller improvement in the number of
>>> instrumented functions (geomean ratio of 87.49%) and the sum of raw count
>>> values (geomean of 82.76%).
>>>
>>>
>>> 3.4 LLVM instrumentations runtime performance for Google internal C/C++
>>> benchmarks
>>>
>>> We also use Google internal benchmarks (mostly typical C++ applications)
>>> to measure the relative performance b/w FE based instrumentation and late
>>> instrumentation.  The following table shows the speedup of late
>>> instrumentation vs FE based instrumentation. Note that C++benchmark01 is a
>>> very large multi-threaded C++ program. Late instrumentation sees 4x speedup.
>>> Larger than 3x speedups are also seen in many other programs.
>>>
>>> C++_bencharmk01    416.98%
>>> C++_bencharmk02      6.29%
>>> C++_bencharmk03     22.39%
>>> C++_bencharmk04     28.05%
>>> C++_bencharmk05      2.00%
>>> C++_bencharmk06    675.89%
>>> C++_bencharmk07    359.19%
>>> C++_bencharmk08    395.03%
>>> C_bencharmk09       15.11%
>>> C_bencharmk10        5.47%
>>> C++_bencharmk11      5.73%
>>> C++_bencharmk12      2.31%
>>> C++_bencharmk13     87.73%
>>> C++_bencharmk14      7.22%
>>> C_bencharmk15       -0.51%
>>> C++_bencharmk16     59.15%
>>> C++_bencharmk17    358.82%
>>> C++_bencharmk18    861.36%
>>> C++_bencharmk19     29.62%
>>> C++_bencharmk20     11.82%
>>> C_bencharmk21        0.53%
>>> C++_bencharmk22     43.10%
>>> ---------------------------
>>> Geomean             83.03%
>>>
>>>
>>> 3.5 Statistics of LLVM profiles for Google internal benchmarks
>>>
>>> The following shows the profile statics using Google internal benchmarks.
>>>                          (1)       (2)       (3)
>>> C++_bencharmk01         36.84%    40.29%     2.32%
>>> C++_bencharmk02         39.20%    40.49%    42.39%
>>> C++_bencharmk03         39.37%    40.65%    23.24%
>>> C++_bencharmk04         39.13%    40.68%    17.70%
>>> C++_bencharmk05         36.58%    38.27%    51.08%
>>> C++_bencharmk06         29.50%    27.87%     2.87%
>>> C++_bencharmk07         29.50%    27.87%     1.73%
>>> C++_bencharmk08         29.50%    27.87%     4.17%
>>> C_bencharmk09           53.95%    68.00%    11.18%
>>> C_bencharmk10           53.95%    68.00%    31.74%
>>> C++_bencharmk11         36.40%    37.07%    46.12%
>>> C++_bencharmk12         38.44%    41.90%    73.59%
>>> C++_bencharmk13         39.28%    42.72%    29.56%
>>> C++_bencharmk14         38.59%    42.20%    13.42%
>>> C_bencharmk15           57.45%    48.50%    66.99%
>>> C++_bencharmk16         36.86%    42.18%    16.53%
>>> C++_bencharmk17         37.82%    39.77%    13.68%
>>> C++_bencharmk18         37.82%    39.77%     7.96%
>>> C++_bencharmk19         37.52%    40.46%     1.85%
>>> C++_bencharmk20         32.37%    30.44%    19.69%
>>> C_bencharmk21           37.63%    40.42%    88.81%
>>> C++_bencharmk22         36.28%    36.92%    21.62%
>>> --------------------------------------------------
>>> Geomean                 38.22%    39.96%    15.58%
>>>
>>>
>>> 4. Implementation Details:
>>>
>>> We need to add new option(s) for the alternative PGO instrumentation pass
>>> in the middle end. It can in one of the following forms:
>>>
>>>    (1) Complete new options that are on par with current PGO options:
>>> -fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation, and
>>> -fprofile-late-instr-use[=<profile_file>]? for PGO USE.
>>>    (2) Or, late instrumentation can be turned on with an additional
>>> option -fprofile-instr-late with current PGO options. I. e.
>>> -fprofile-instr-late -fprofile-instr-generate[=<profile_file>]? for PGO
>>> instrumentation, and -fprofile-instr-late
>>> -fprofile-instr-use[=<profile_file>]? for PGO use.
>>>    (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO
>>> instrumentation. Adding a magic tag in profile so that FE based profile and
>>> late instrumented profile can be automatically detected by profile loader In
>>> PGO use compilation. This requires a slight profile format change.
>>>
>>> In our prototype implementation, two new passes are added in the
>>> beginning of PassManagerBuilder::populateModulePassManager(), namely
>>> PreProfileInlinerPass and PGOLateInstrumentationPass.
>>>
>>>
>>> 4.1 Pre-inline pass:
>>>
>>> It is controlled by back-end option "-preinline" and
>>> "-disable-preinline". If the user specifies any llvm option of
>>> "-fprofile-late-instr-{generate|use}, option "-mllvm -preinline" will be
>>> automatically inserted in the driver.. To disable the pre-inliner when late
>>> instrumentation is enabled, use option "-mllvm -disable-preinline".
>>>
>>> For now, only minimum tuning is done for the pre-inliner, which simply
>>> adjusts the inline threshold: If -Oz is specified, the threshold is set to
>>> 25. Otherwise, it is 75.
>>>
>>> The following clean up passes are added to PassManager, right after the
>>> PreProfileInline pass:
>>>   createEarlyCSEPass()
>>>   createJumpThreadingPass()
>>>   createCorrelatedValuePropagationPass()
>>>   createCFGSimplificationPass()
>>>   createInstructionCombiningPass()
>>>   createGVNPass(DisableGVNLoadPRE)
>>>   createPeepholePASS()
>>> Some of them might not be necessary.
>>>
>>> 4.2 Late Instrumentation Pass:
>>> The late instrumentation is right after the pre-inline pass and it's
>>> cleanup passes. It is controlled by opt option "-pgo-late-instr-gen" and
>>> "-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will
>>> provide the profile name.
>>> For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass()
>>> is also added to PassManager to lower the instrumentation intrinsics
>>>
>>> PGOLateInstrumeatnion is a module pass that applies the instrumentation
>>> to each function by class PGOLateInstrumentationFunc. For each function,
>>> perform the following steps:
>>>    1. First collect all the CFG edges. Assign an estimated weight to each
>>> edge. Critical edges and back-edges are assigned to high value of weights.
>>> One fake node and a few fake edges (from the fake node to the entry node,
>>> and from all the exit nodes to the fake node) are also added to the
>>> worklist.
>>>    2. Construct the MST. The edges with the higher weight will be put to
>>> MST first, unless it forms a cycle.
>>>    3. Traverse the CFG and compute the CFG hash using CRC32 of the index
>>> of each BB.
>>> The above three steps are the same for profile-generate and profile-use
>>> compilation.
>>>
>>> In the next step, for profile-generation compilation, all the edges that
>>> not in the MST are instrumented. If this is a critical edge, split the edge
>>> first. The actual instrumentation is to generate
>>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will
>>> be lowed by pass createInstrProfilingPass().
>>>
>>> In the next step, for profile-generation compilation, all the edges that
>>> not in the MST are instrumented. If this is a critical edge, split the edge
>>> first. The actual instrumentation is to generate
>>> Intrinsic::instrprof_increment() in the instrumented BB. This intrinsic will
>>> be lowed by pass createInstrProfilingPass().
>>>
>>> For -fprofile-use compilation, first read in the counters and the CFG
>>> hash from the profile file. If the CFG hash matches, populate the counters
>>> to all the edges in reverse topological order of the MST. Once having all
>>> the edge counts, set the branch weights metadata for the IR having multiple
>>> branches. Also apply the cold/hot function attributes based on function
>>> level counts.
>>>
>>>
>>> 4.3 Profile Format:
>>>
>>> The late instrumentation profile is mostly the same as the one from
>>> front-end instrument-ion. The difference is
>>>    * Function checksums are different.
>>>    * Function entry counts are no longer available.
>>> For llvm-profdata utility, options -lateinstr needs to be used to
>>> differentiate FE based and late instrumentation profiles, unless a magic tag
>>> is added to the profile.
>>>
>>>
>>> 5. References:
>>> [1] Donald E. Knuth, Francis R. Stevenson. Optimal measurement of points
>>> for program frequency counts. BIT Numerical Mathematics 1973, Volume 13,
>>> Issue 3, pp 313-322
>>>
>>
>