<html><head><style type='text/css'>p { margin: 0; }</style></head><body><div style='font-family: arial,helvetica,sans-serif; font-size: 10pt; color: #000000'><br><hr id="zwchr"><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><b>From: </b>"Dehao Chen" <dehao@google.com><br><b>To: </b>"Hal Finkel" <hfinkel@anl.gov><br><b>Cc: </b>"Xinliang David Li" <davidxl@google.com>, "llvm-dev" <llvm-dev@lists.llvm.org><br><b>Sent: </b>Tuesday, November 1, 2016 11:43:41 AM<br><b>Subject: </b>Re: [llvm-dev] (RFC) Encoding code duplication factor in discriminator<br><br><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">Hi Dehao,<br>

<br>

This is definitely an important problem, thanks for writing this up!<br>

<br>

There is a related problem that I think we can address at the same time: When we multiversion code, for example when we use runtime checks to enable the creation of a vectorized loop while retaining the scalar loop, and then we collect profiling data, we should be able to recover the relative running time of the different versions of the loop. In the future, I suspect we'll end up with transformations that produce even more versions of loops (Intel's compiler, for example, has a useful pragma that allows the user to request loop specializations for a specified set of trip counts).<br>

<br>

I'd like to have a scheme where the source location + discriminator can be mapped to information about the relevant loop version so that our profiling tools can display that information usefully to the user, and so that our optimizations can make useful choices (i.e. don't bother vectorizing a loop when the scalar loop is run a lot but the vectorized version almost never runs).<br></blockquote><div><br></div><div>That's definitely a valid and important use case, and it is important to sample pgo too. That's why I proposed to have "<span id="DWT6179" style="font-size: 12.8px;">duplicated code that may have different execution count" being recorded. Will that suffice to get the info you want? (i.e. for every version of the multi-versioned loop, you will have a disincentive discriminator associated with all the code it expands.</span></div></div></div></div></blockquote>I don't know. Can you explain how the process will work? By the time the code/metadata arrives at, say, the loop vectorizer, how can we tell whether the vectorized version we might now create will be executed (based on having profiling data from a run where the compiler might have previously made a similar choice)?<br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><span style="font-size: 12.8px;"></span></div><div> </div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br>

In short, I think that differentiating these different regions using the descriminator seems like a natural choice, but "And we do not need to introduce new building blocks to debug info" might not save us all that much in the long run. To keep information on what regions correspond to what optimizations, we may need to do that. That's not a bad thing, and I'd rather we solve this in a way that is extensible. Plus, this might make it easier to use fewer bits, thus helping the overall impact on the size of the debug sections.<br></blockquote><div><br></div><div id="DWT6180">I agree that if we want to extend this in the future, we need to have separate dwarf bits other than discriminator. For current use case, discriminator seem to be good enough. And if we encode efficiently, it will be better than introducing a new field. e.g., we can encode all info in a 1-byte ULEB128 85%~90% of the time, but for a new field, we will at least need 2 bytes if both discriminator and cloning info exists for an instruction.</div></div></div></div></blockquote>Is this because you need at least one more byte for the debug-info field?<br><br>In general, I don't really care where we stuff the bits so long as we can get the necessary information back out. For a vectorized loop, for example, we should be able to figure out which counts go to the vectorized loop vs. the scalar loop. I don't, however, want to end up with something that is super-non-extensible (e.g. we have only a few bits, so the vectorizer can get one and the unroller can get one, but loop distribution is out of luck). Maybe we need an 'extension' bit saying that there is more information encoded elsewhere?<br><br>Thanks again,<br>Hal<br><blockquote style="border-left: 2px solid rgb(16, 16, 255); margin-left: 5px; padding-left: 5px; color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica,Arial,sans-serif; font-size: 12pt;"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div></div><div><br></div><div>Dehao</div><div> </div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<br>

Thanks again,<br>

Hal<br>

<div><div class="gmail-h5"><br>

<hr id="zwchr"><br>

> From: "Dehao Chen via llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>

> To: <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

> Cc: "Xinliang David Li" <<a href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>><br>

> Sent: Thursday, October 27, 2016 1:39:15 PM<br>

> Subject: [llvm-dev] (RFC) Encoding code duplication factor in discriminator<br>

><br>

> Motivation:<br>

> Many optimizations duplicate code. E.g. loop unroller duplicates the<br>

> loop body, GVN duplicates computation, etc. The duplicated code will<br>

> share the same debug info with the original code. For SamplePGO, the<br>

> debug info is used to present the profile. Code duplication will<br>

> affect profile accuracy. Taking loop unrolling for example:<br>

><br>

><br>

> #1 foo();<br>

> #2 for (i = 0; i < N; i++) {<br>

><br>

> #3 bar();<br>

> #4 }<br>

><br>

><br>

> If N is 8 during runtime, a reasonable profile will look like:<br>

><br>

><br>

> #1: 10<br>

> #3: 80<br>

><br>

><br>

><br>

> But if the compiler unrolls the loop by a factor of 4, the callsite<br>

> to bar() is duplicated 4 times and the profile will look like:<br>

><br>

><br>

> #1: 10<br>

> #3: 20<br>

><br>

><br>

> The sample count for #3 is 20 because all 4 callsites to bar() are<br>

> sampled 20 times each, and they shared the same debug loc (#3) so<br>

> that 20 will be attributed to #3 (If one debugloc is mapped by<br>

> multiple instructions, the max sample count of these instructions is<br>

> used as debugloc's sample count).<br>

><br>

><br>

> When loading this profile into compiler, it will think the loop trip<br>

> count is 2 instead of 8.<br>

><br>

><br>

> Proposal:<br>

> When compiler duplicates code, it encodes the duplication info in the<br>

> debug info. As the duplication is not interesting to debugger, I<br>

> propose to encode this as part of the discriminator.<br>

><br>

><br>

> There are 2 types of code duplication:<br>

><br>

><br>

> 1. duplicated code are guaranteed to have the same execution count<br>

> (e.g. loop unroll and loop vectorize). We can record the duplication<br>

> factor, for the above example "4" is recorded in the discriminator.<br>

> 2. duplicated code that may have different execution count (e.g. loop<br>

> peel and gvn). For a same debugloc, a unique number is assigned to<br>

> each copy and encoded in the discriminator.<br>

><br>

><br>

> Assume that the discriminator is uint32. The traditional<br>

> discriminator is less than 256, let's take 8 bit for it. For<br>

> duplication factor (type 1 duplication), we assume the maximum<br>

> unroll_factor * vectorize_factor is less than 256, thus 8 bit for<br>

> it. For unique number(type 2 duplication), we assume code is at most<br>

> duplicated 32 times, thus 5 bit for it. Overall, we still have 11<br>

> free bits left in the discriminator encoding.<br>

><br>

><br>

> Let's take the original source as an example, after loop unrolling<br>

> and peeling, the code may looks like:<br>

><br>

><br>

> for (i = 0; i < N & 3; i+= 4) {<br>

> foo(); // discriminator: 0x40<br>

> foo(); // discriminator: 0x40<br>

> foo(); // discriminator: 0x40<br>

> foo(); // discriminator: 0x40<br>

> }<br>

> if (i++ < N) {<br>

> foo(); // discriminator: 0x100<br>

> if (i++ < N) {<br>

> foo(); // discriminator: 0x200<br>

> if (i++ < N) {<br>

> foo(); // discriminator: 0x300<br>

> }<br>

> }<br>

> }<br>

><br>

><br>

> The cost of this change would be increased debug_line size because:<br>

> 1. we are recording more discriminators 2. discriminators become<br>

> larger and will take more ULEB128 encoding.<br>

><br>

><br>

> The benefit is that the sample pgo profile can accurately represent<br>

> the code execution frequency. And we do not need to introduce new<br>

> building blocks to debug info.<br>

><br>

><br>

> Comments?<br>

><br>

><br>

> Thanks,<br>

> Dehao<br>

</div></div>> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

><br>

<span class="gmail-HOEnZb"><font color="#888888"><br>

--<br>

Hal Finkel<br>

Lead, Compiler Technology and Programming Languages<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</font></span></blockquote></div><br></div></div>

</blockquote><br><br><br>-- <br><div><span name="x"></span>Hal Finkel<br>Lead, Compiler Technology and Programming Languages<br>Leadership Computing Facility<br>Argonne National Laboratory<span name="x"></span><br></div></div></body></html>