<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - PGO instrumentation profile data is not reflected in correct basic blocks"
href="https://llvm.org/bugs/show_bug.cgi?id=27024">27024</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>PGO instrumentation profile data is not reflected in correct basic blocks
</td>
</tr>
<tr>
<th>Product</th>
<td>tools
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>Other
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>opt
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>suganuma@jp.ibm.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>It seems that instrumented BBs do not match between the two compilations for
profile-gen and profile-use for some cases. An example is SPECcpu2006 lbm.
In the first compilation, we have 5 instrumentation points for the main
function.
Running a training workload, we get profile data for the main function Block
counts: [0, 300, 4, 1, 1], but these count values are put into incorrect BBs in
the second compilation.
Here is the steps to reproduce. The test case is SPECcpu2006 lbm benchmark
(consisting of just two modules: main.c and lbm.c).
1. Compile for instrumentation generation
$ clang -c -o lbm.bc -O2 -m64 -emit-llvm -DSPEC_CPU -DNDEBUG lbm.c
$ clang -c -o main.bc -O2 -m64 -emit-llvm -DSPEC_CPU -DNDEBUG main.c
$ llvm-link -o _all_combined.bc lbm.bc main.bc
$ opt -pgo-instr-gen -instrprof _all_combined.bc -o _all_combined_inst.bc
$ clang -o lbm _all_combined_inst.bc -O2 -m64 -fprofile-instr-generate -lm
2. Run lbm with training input
$ ./lbm 300 reference.dat 0 1 100_100_130_cf_b.of >lbm.train.out
2>lbm.train.err
$ llvm-profdata merge default.profraw -o code.profdata
3. Reoptimize with profile data
$ opt -analyze -pgo-instr-use _all_combined.bc
If we specify -debug-only=pgo-instrumentation for opt command, we can get the
following information for main function in Step 1.
Dump Function main Hash: 61483163021 after CFGMST
Number of Basic Blocks: 10
BB: FakeNode Index=0
BB: entry Index=1
BB: for.end Index=2
BB: for.end.loopexit Index=9
BB: for.inc Index=8
BB: if.then5 Index=7
BB: if.end Index=6
BB: if.then Index=5
BB: for.body Index=4
BB: for.body.lr.ph Index=3
Number of Edges: 14 (*: Instrument, C: CriticalEdge, -: Removed)
Edge 0: 8-->4 c W=247031
Edge 1: 6-->8 c W=159375
Edge 2: 4-->6 *c W=127500
Edge 3: 1-->2 c W=4500
Edge 4: 4-->5 W=127
Edge 5: 5-->6 * W=127
Edge 6: 6-->7 W=95
Edge 7: 7-->8 * W=95
Edge 8: 0-->1 W=12
Edge 9: 2-->0 * W=12
Edge 10: 3-->4 W=8
Edge 11: 9-->2 W=8
Edge 12: 1-->3 W=7
Edge 13: 8-->9 * W=7
Split critical edge: 4 --> 6
Adding Instrumentation in BB Name=for.body.if.end_crit_edge
Adding Instrumentation in BB Name=if.then
Adding Instrumentation in BB Name=if.then5
Adding Instrumentation in BB Name=for.end
Adding Instrumentation in BB Name=for.end.loopexit
Note that the last 5 lines above are my additional debug statements.
In Step 3, however, we get the following from the debug statements.
Dump Function main Hash: 61483163021 after CFGMST
Number of Basic Blocks: 10
BB: FakeNode Index=0
BB: for.end.loopexit Index=9
BB: for.inc Index=8
BB: if.then5 Index=7
BB: for.end Index=2
BB: for.body.lr.ph Index=3
BB: entry Index=1
BB: if.end Index=6
BB: if.then Index=5
BB: for.body Index=4
Number of Edges: 14 (*: Instrument, C: CriticalEdge, -: Removed)
Edge 0: 8-->4 c W=247031
Edge 1: 6-->8 c W=159375
Edge 2: 4-->6 *c W=127500
Edge 3: 1-->2 c W=127058
Edge 4: 0-->1 W=135
Edge 5: 2-->0 * W=135
Edge 6: 4-->5 W=127
Edge 7: 5-->6 * W=127
Edge 8: 6-->7 W=95
Edge 9: 7-->8 * W=95
Edge 10: 3-->4 W=8
Edge 11: 9-->2 W=8
Edge 12: 1-->3 W=7
Edge 13: 8-->9 * W=7
5 counts
0: 0
1: 300
2: 4
3: 1
4: 1
SUM = 306
Split critical edge: 4 --> 6
Setting BB Name=for.body.if.end_crit_edge with CountValue=0
Setting BB Name=for.end with CountValue=300
Setting BB Name=if.then with CountValue=4
Setting BB Name=if.then5 with CountValue=1
Setting BB Name=for.end.loopexit with CountValue=1
The CountValue 300 should go to the BB=if.then (Index 5), not for.end (Index
2). Actually because of this incorrect setting, the entry count of the main
function is set 300, instead of 1 (after populating the count values).
The reason for this problem is that CFGMST edges are ordered in a different way
due to different weight values (edges 0 --> 1 and 2 --> 0 get W=12 in Step 1,
while they get W=135 in Step 3). The weight values are computed based on block
frequency info and branch probability info, but somehow they produce different
values between the two compilations.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>