[llvm-dev] GSoC Proposal : Path Profiling Support

Wed Mar 16 12:44:58 PDT 2016

Hi Vedant,

I would like to clarify that the proposal does *intra-procedural* path profiling as described in [Ball96]. 

> > This proposal adds support  for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime.

> In [Ball96], the authors mention that the worst case number of acyclic
> paths in a program is exponential in its size. How do you manage this? Is
> the amount of instrumentation necessary to handle this scenario also
> exponential in the size of the program? 

Paths are numbered from 0 to NumPaths-1. This can be exponential in size. In practice we find that they fit in 64-bit wide counters are enough for numbering paths within a routine. The amount of instrumentation required is not exponential. For example, every if-condition in a routine will increase the number of paths by X2, but the instrumentation points required to disambiguate between each side of the branch is only 1. Furthermore, we reduce the amount of instrumentation using [Ball94] which is proven to be optimal. 

> > Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles.

> I haven't had time to digest [Young98]. Could you explain in some more
> detail which optimizations stand to benefit from path profile information?

[Young98] target superblock enlargement optimizations. Having precise information about the paths enables additional optimizations using value numbering and subsequent dead code elimination. In our experience, isolating paths with reduced control flow allows aggressive dead-store-elimination, dead-code-elimination and constant propagation. 

> > Over the last year, I have built a toolchain which uses path profiling as its basis.

> Do you have any numbers about the size of path profiles, compared to the
> size of raw instrprof-based profiles?

No. We use a plain text representation internally which can be further optimized.  

> In general I'm curious about how we can support users of instrprof-style
> profiles with path profiles. Do you have plans to support code coverage
> using path profiles? 

Perfect instrprof data can be derived from path profiles. Code coverage can be supported using path profiles though we do not do so as part of our implementation.  

> Can path profiles from dylibs be concatenated together
> to yield valid profiles?

Yes. As each routine is independently instrumented, the profiles can be concatenated. However, it is difficult to reason about which paths in the caller invoke which paths in the callee. Melinski and Reps describe how to do so in  Interprocedural Path Profiling (CC'99). This is beyond the scope of this proposal. 

> > + Used to analyse C/C++ benchmarks from SPECCPU2006

> I'd be interested to see any numbers you have about this. In particular,
> what's the compile-time overhead of path profiling instrumentation? What
> are the sizes of instrumented binaries, and what's the slowdown factor?

You can find similar numbers reported by [Ball96] in Table 1 of the paper. Numbers from our selected workloads are shown below. 
These numbers use 256 bit APInt counters (primary cause for slowdown) along with a DenseMap<APInt, APInt> for Id => Counter mapping. 
The runtime is implemented as a shared library. The benchmarks are selected from SPEC2006 and Parsec 3.0. Each was 
compiled to a single bitcode module and a particular function was picked for instrumentation. The function was 
determined to be (one of) the primary work functions. The bitcodes were compiled with -O2 optimization post 
instrumentation. 

Note that the frequency of paths executed is not shown in this table and is a factor for slowdown. 
Each path is executed multiple times and each path has varying number of instrumentation points.
This data shows the *worst possible case* where 256-bit APInt counters are used to calculate path ids as well
as maintain runtime frequency counts in the shared library. Note that all benchmarks are well within the
64-bit integer range for path-ids. 

Thus the data in the [Ball96] paper shows the best case overheads, while the data I have included herein 
shows the worst case overheads for instrumentation+runtime.

+ Key

numpaths = Number of possible paths
epp+compile = Time taken to compute encoding, insert instrumentation and compile to executable
compile = Time taken to compile to executable
execpaths = Number of paths dynamically executed
epp-exec-time = Execution time with instrumentation
exec-time = Normal execution time
epp-bin-size = Size of instrumented binary in bytes
bin-size = Size of binary
** size of shared library in bytes = 598042

+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| benchmark     | numpaths | epp+compile | compile   | execpaths | epp-exec-time | exec-time | epp-bin-size | bin-size |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| blks          | 2        | 0m1.036s    | 0m1.008s  | 2         | 0m3.643s      | 0m3.205s  | 155931       | 155459   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| bodytrack     | 29       | 0m4.907s    | 0m4.881s  | 5         | 0m14.786s     | 0m1.943s  | 2125256      | 2124224  |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| bzip2         | 60       | 0m1.274s    | 0m1.268s  | 3         | 0m9.441s      | 0m9.624s  | 259125       | 258477   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| ferret        | 360921   | 0m26.208s   | 0m26.102s | 40        | 0m10.342s     | 0m6.224s  | 8342571      | 8338588  |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| fluidanimate  | 384117   | 0m0.895s    | 0m0.869s  | 88        | 0m56.631s     | 0m1.294s  | 202702       | 197878   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| freqmine      | 45       | 0m1.220s    | 0m1.214s  | 18        | 0m22.150s     | 0m5.515s  | 278615       | 277656   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| gcc           | 6026     | 0m31.941s   | 0m31.327s | 125       | 1m30.139s     | 0m36.601s | 6991413      | 6991245  |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| hmmer         | 1882     | 0m3.193s    | 0m3.232s  | 65        | 0m58.911s     | 0m2.474s  | 744510       | 742806   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| mcf           | 230      | 0m0.838s    | 0m0.830s  | 10        | 0m11.097s     | 0m3.074s  | 162680       | 161736   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| mcf2000       | 1155     | 0m0.859s    | 0m0.853s  | 26        | 0m24.169s     | 0m4.625s  | 166092       | 165213   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| povray        | 17       | 0m8.543s    | 0m8.552s  | 4         | 9m24.562s     | 5m39.295s | 2388152      | 2387960  |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| sjeng         | 158740   | 0m1.648s    | 0m1.637s  | 280       | 0m20.786s     | 0m5.229s  | 368841       | 368009   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| soplex        | 30       | 0m4.849s    | 0m4.848s  | 24        | 7m28.151s     | 4m10.813s | 1244775      | 1242063  |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| sphinx        | 26       | 0m2.212s    | 0m2.198s  | 5         | 1m36.291s     | 0m13.811s | 543534       | 543358   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| streamcluster | 21121728 | 0m0.947s    | 0m0.908s  | 33        | 0m50.212s     | 0m5.986s  | 191981       | 185438   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| swaptions     | 20655    | 0m0.965s    | 0m0.950s  | 13        | 0m0.263s      | 0m0.178s  | 193841       | 184274   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| h264ref       | 24130    | 0m4.278s    | 0m4.272s  | 76        | 3m26.701s     | 3m4.461s  | 816660       | 812396   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| lbm           | 8        | 0m0.824s    | 0m0.815s  | 5         | 6m29.685s     | 1m39.180s | 150871       | 150327   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+
| namd          | 59598954 | 0m4.124s    | 0m4.139s  | 43        | 18m36.447s    | 6m50.288s | 925863       | 925271   |
+---------------+----------+-------------+-----------+-----------+---------------+-----------+--------------+----------+

> > Open Issues :
> > + Update PathProfileInfo on CFG transformations ?

> Could you clarify what this means?

Changing the control flow graph of a routine may invalidate collected path profiles. For example, splitting a block with an unconditional branch does not change the profile, but introducing a conditional branch invalidates the profile. The issue I would like to address is which transformations should we allow as safe transformations and how should we update the internal path profile data structures if we allow this at all.

> > + Verify with PGOEdge info ?

> Ditto.

Verification with PGOEdge info implies that the edge frequencies derived from path profiles and via instrprof should be equal. 

> > + Handle setjmp, longjmp, early program termination, noreturn calls

> How do you handle indirect calls?

No special handling of indirect calls as path profiles are intra-procedural and control returns to same basic block
after call in the general case. For the above mentioned cases, control may not return.

Regards,
Snehasish