<div dir="ltr">The impact to debug_line is actually not small. I only implemented the part 1 (encoding duplication factor) for loop unrolling and loop vectorization. The debug_line size overhead for "-O2 -g1" binary of speccpu C/C++ benchmarks:<div><br></div><div>
<table cellspacing="0" border="0">
<colgroup span="2" width="85"></colgroup>
<tbody><tr>
<td height="17" align="left">433.milc</td>
<td align="right">23.59%</td>
</tr>
<tr>
<td height="17" align="left">444.namd</td>
<td align="right">6.25%</td>
</tr>
<tr>
<td height="17" align="left">447.dealII</td>
<td align="right">8.43%</td>
</tr>
<tr>
<td height="17" align="left">450.soplex</td>
<td align="right">2.41%</td>
</tr>
<tr>
<td height="17" align="left">453.povray</td>
<td align="right">5.40%</td>
</tr>
<tr>
<td height="17" align="left">470.lbm</td>
<td align="right">0.00%</td>
</tr>
<tr>
<td height="17" align="left">482.sphinx3</td>
<td align="right">7.10%</td>
</tr>
<tr>
<td height="17" align="left">400.perlbench</td>
<td align="right">2.77%</td>
</tr>
<tr>
<td height="17" align="left">401.bzip2</td>
<td align="right">9.62%</td>
</tr>
<tr>
<td height="17" align="left">403.gcc</td>
<td align="right">2.67%</td>
</tr>
<tr>
<td height="17" align="left">429.mcf</td>
<td align="right">9.54%</td>
</tr>
<tr>
<td height="17" align="left">445.gobmk</td>
<td align="right">7.40%</td>
</tr>
<tr>
<td height="17" align="left">456.hmmer</td>
<td align="right">9.79%</td>
</tr>
<tr>
<td height="17" align="left">458.sjeng</td>
<td align="right">9.98%</td>
</tr>
<tr>
<td height="17" align="left">462.libquantum</td>
<td align="right">10.90%</td>
</tr>
<tr>
<td height="17" align="left">464.h264ref</td>
<td align="right">30.21%</td>
</tr>
<tr>
<td height="17" align="left">471.omnetpp</td>
<td align="right">0.52%</td>
</tr>
<tr>
<td height="17" align="left">473.astar</td>
<td align="right">5.67%</td>
</tr>
<tr>
<td height="17" align="left">483.xalancbmk</td>
<td align="right">1.46%</td>
</tr>
<tr>
<td height="17" align="left">mean</td>
<td align="right">7.86%</td>
</tr>
</tbody></table><br></div><div>Dehao</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 27, 2016 at 11:55 AM, Xinliang David Li <span dir="ltr"><<a href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Do you have an estimate of the debug_line size increase? I guess it will be small.<span class="HOEnZb"><font color="#888888"><div><br></div><div>David</div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 27, 2016 at 11:39 AM, Dehao Chen <span dir="ltr"><<a href="mailto:dehao@google.com" target="_blank">dehao@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Motivation:<div>Many optimizations duplicate code. E.g. loop unroller duplicates the loop body, GVN duplicates computation, etc. The duplicated code will share the same debug info with the original code. For SamplePGO, the debug info is used to present the profile. Code duplication will affect profile accuracy. Taking loop unrolling for example:</div><div><br></div><div>#1 foo();</div><div>#2 for (i = 0; i < N; i++) {<br></div><div>#3 bar();</div><div>#4 }</div><div><br></div><div>If N is 8 during runtime, a reasonable profile will look like:</div><div><br></div><div>#1: 10</div><div>#3: 80<br></div><div><br></div><div>But if the compiler unrolls the loop by a factor of 4, the callsite to bar() is duplicated 4 times and the profile will look like:</div><div><br></div><div>#1: 10</div><div>#3: 20</div><div><br></div><div>The sample count for #3 is 20 because all 4 callsites to bar() are sampled 20 times each, and they shared the same debug loc (#3) so that 20 will be attributed to #3 (If one debugloc is mapped by multiple instructions, the max sample count of these instructions is used as debugloc's sample count).</div><div><br></div><div>When loading this profile into compiler, it will think the loop trip count is 2 instead of 8.</div><div><br></div><div>Proposal:</div><div>When compiler duplicates code, it encodes the duplication info in the debug info. As the duplication is not interesting to debugger, I propose to encode this as part of the discriminator.</div><div><br></div><div>There are 2 types of code duplication:</div><div><br></div><div>1. duplicated code are guaranteed to have the same execution count (e.g. loop unroll and loop vectorize). We can record the duplication factor, for the above example "4" is recorded in the discriminator.</div><div>2. duplicated code that may have different execution count (e.g. loop peel and gvn). For a same debugloc, a unique number is assigned to each copy and encoded in the discriminator.</div><div><br></div><div>Assume that the discriminator is uint32. The traditional discriminator is less than 256, let's take 8 bit for it. For duplication factor (type 1 duplication), we assume the maximum unroll_factor * vectorize_factor is less than 256, thus 8 bit for it. For unique number(type 2 duplication), we assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we still have 11 free bits left in the discriminator encoding.</div><div><br></div><div>Let's take the original source as an example, after loop unrolling and peeling, the code may looks like:</div><div><br></div><div>for (i = 0; i < N & 3; i+= 4) {</div><div> foo(); // discriminator: 0x40</div><div> foo(); // discriminator: 0x40</div><div> foo(); // discriminator: 0x40</div><div> foo(); // discriminator: 0x40</div><div>}</div><div>if (i++ < N) {</div><div> foo(); // discriminator: 0x100</div><div> if (i++ < N) {</div><div> foo(); // discriminator: 0x200</div><div> if (i++ < N) {</div><div> foo(); // discriminator: 0x300</div><div> }</div><div> }</div><div>}</div><div><br></div><div>The cost of this change would be increased debug_line size because: 1. we are recording more discriminators 2. discriminators become larger and will take more ULEB128 encoding.</div><div><br></div><div>The benefit is that the sample pgo profile can accurately represent the code execution frequency. And we do not need to introduce new building blocks to debug info.</div><div><br></div><div>Comments?</div><div><br></div><div>Thanks,</div><div>Dehao</div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>