[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Thu Oct 27 11:39:15 PDT 2016

Motivation:
Many optimizations duplicate code. E.g. loop unroller duplicates the loop
body, GVN duplicates computation, etc. The duplicated code will share the
same debug info with the original code. For SamplePGO, the debug info is
used to present the profile. Code duplication will affect profile accuracy.
Taking loop unrolling for example:

#1 foo();
#2 for (i = 0; i < N; i++) {
#3   bar();
#4 }

If N is 8 during runtime, a reasonable profile will look like:

#1: 10
#3: 80

But if the compiler unrolls the loop by a factor of 4, the callsite to
bar() is duplicated 4 times and the profile will look like:

#1: 10
#3: 20

The sample count for #3 is 20 because all 4 callsites to bar() are sampled
20 times each, and they shared the same debug loc (#3) so that 20 will be
attributed to #3 (If one debugloc is mapped by multiple instructions, the
max sample count of these instructions is used as debugloc's sample count).

When loading this profile into compiler, it will think the loop trip count
is 2 instead of 8.

Proposal:
When compiler duplicates code, it encodes the duplication info in the debug
info. As the duplication is not interesting to debugger, I propose to
encode this as part of the discriminator.

There are 2 types of code duplication:

1. duplicated code are guaranteed to have the same execution count (e.g.
loop unroll and loop vectorize). We can record the duplication factor, for
the above example "4" is recorded in the discriminator.
2. duplicated code that may have different execution count (e.g. loop peel
and gvn). For a same debugloc, a unique number is assigned to each copy and
encoded in the discriminator.

Assume that the discriminator is uint32. The traditional discriminator is
less than 256, let's take 8 bit for it. For duplication factor (type 1
duplication), we assume the maximum unroll_factor * vectorize_factor is
less than 256, thus 8 bit for it. For unique number(type 2 duplication), we
assume code is at most duplicated 32 times, thus 5 bit for it. Overall, we
still have 11 free bits left in the discriminator encoding.

Let's take the original source as an example, after loop unrolling and
peeling, the code may looks like:

for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
  foo();  // discriminator: 0x40
}
if (i++ < N) {
  foo();   // discriminator: 0x100
  if (i++ < N) {
    foo(); // discriminator: 0x200
    if (i++ < N) {
      foo();  // discriminator: 0x300
    }
  }
}

The cost of this change would be increased debug_line size because: 1. we
are recording more discriminators 2. discriminators become larger and will
take more ULEB128 encoding.

The benefit is that the sample pgo profile can accurately represent the
code execution frequency. And we do not need to introduce new building
blocks to debug info.

Comments?

Thanks,
Dehao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161027/a126a922/attachment.html>