[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Thu Oct 27 11:44:52 PDT 2016

Is there prior art for this sort of thing (in GCC, for example) - I take it
this isn't the first time this has come up as a problem for profile
accuracy? (so it'd be helpful to know prior solutions to this (& if we're
not doing whatever was done before, what it is about our situation that's
different, etc), or why it hasn't been a problem, etc)

On Thu, Oct 27, 2016 at 11:39 AM Dehao Chen <dehao at google.com> wrote:

> Motivation:
> Many optimizations duplicate code. E.g. loop unroller duplicates the loop
> body, GVN duplicates computation, etc. The duplicated code will share the
> same debug info with the original code. For SamplePGO, the debug info is
> used to present the profile. Code duplication will affect profile accuracy.
> Taking loop unrolling for example:
>
> #1 foo();
> #2 for (i = 0; i < N; i++) {
> #3   bar();
> #4 }
>
> If N is 8 during runtime, a reasonable profile will look like:
>
> #1: 10
> #3: 80
>
> But if the compiler unrolls the loop by a factor of 4, the callsite to
> bar() is duplicated 4 times and the profile will look like:
>
> #1: 10
> #3: 20
>
> The sample count for #3 is 20 because all 4 callsites to bar() are sampled
> 20 times each, and they shared the same debug loc (#3) so that 20 will be
> attributed to #3 (If one debugloc is mapped by multiple instructions, the
> max sample count of these instructions is used as debugloc's sample count).
>
> When loading this profile into compiler, it will think the loop trip count
> is 2 instead of 8.
>
> Proposal:
> When compiler duplicates code, it encodes the duplication info in the
> debug info. As the duplication is not interesting to debugger, I propose to
> encode this as part of the discriminator.
>
> There are 2 types of code duplication:
>
> 1. duplicated code are guaranteed to have the same execution count (e.g.
> loop unroll and loop vectorize). We can record the duplication factor, for
> the above example "4" is recorded in the discriminator.
> 2. duplicated code that may have different execution count (e.g. loop peel
> and gvn). For a same debugloc, a unique number is assigned to each copy and
> encoded in the discriminator.
>
> Assume that the discriminator is uint32. The traditional discriminator is
> less than 256, let's take 8 bit for it. For duplication factor (type 1
> duplication), we assume the maximum unroll_factor * vectorize_factor is
> less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
> assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
> still have 11 free bits left in the discriminator encoding.
>
> Let's take the original source as an example, after loop unrolling and
> peeling, the code may looks like:
>
> for (i = 0; i < N & 3; i+= 4) {
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
>   foo();  // discriminator: 0x40
> }
> if (i++ < N) {
>   foo();   // discriminator: 0x100
>   if (i++ < N) {
>     foo(); // discriminator: 0x200
>     if (i++ < N) {
>       foo();  // discriminator: 0x300
>     }
>   }
> }
>
> The cost of this change would be increased debug_line size because: 1. we
> are recording more discriminators 2. discriminators become larger and will
> take more ULEB128 encoding.
>
> The benefit is that the sample pgo profile can accurately represent the
> code execution frequency. And we do not need to introduce new building
> blocks to debug info.
>
> Comments?
>
> Thanks,
> Dehao
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/558f732e/attachment.html>