[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Tue Nov 1 11:24:01 PDT 2016

On Tue, Nov 1, 2016 at 10:50 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"Xinliang David Li" <davidxl at google.com>, "llvm-dev" <
> llvm-dev at lists.llvm.org>
> *Sent: *Tuesday, November 1, 2016 11:43:41 AM
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>> Hi Dehao,
>>
>> This is definitely an important problem, thanks for writing this up!
>>
>> There is a related problem that I think we can address at the same time:
>> When we multiversion code, for example when we use runtime checks to enable
>> the creation of a vectorized loop while retaining the scalar loop, and then
>> we collect profiling data, we should be able to recover the relative
>> running time of the different versions of the loop. In the future, I
>> suspect we'll end up with transformations that produce even more versions
>> of loops (Intel's compiler, for example, has a useful pragma that allows
>> the user to request loop specializations for a specified set of trip
>> counts).
>>
>> I'd like to have a scheme where the source location + discriminator can
>> be mapped to information about the relevant loop version so that our
>> profiling tools can display that information usefully to the user, and so
>> that our optimizations can make useful choices (i.e. don't bother
>> vectorizing a loop when the scalar loop is run a lot but the vectorized
>> version almost never runs).
>>
>
> That's definitely a valid and important use case, and it is important to
> sample pgo too. That's why I proposed to have "duplicated code that may
> have different execution count" being recorded. Will that suffice to get
> the info you want? (i.e. for every version of the multi-versioned loop, you
> will have a disincentive discriminator associated with all the code it
> expands.
>
> I don't know. Can you explain how the process will work? By the time the
> code/metadata arrives at, say, the loop vectorizer, how can we tell whether
> the vectorized version we might now create will be executed (based on
> having profiling data from a run where the compiler might have previously
> made a similar choice)?
>

For example, for the following loop:

for (i = 0; i < N; i++)
  stmt;

After many loop optimizations it could become something like:

if (vectorized1)
  for (i = 0;.....)
     stmt;    //uid: 1
else if (vectorized2)
  for (i = 0;.....)
    stmt;    //uid: 2
else if (unrolled)
  for (i = 0; i < N & (~0x3) i += 4)
    stmt; stmt; stmt; stmt; //uid: 3
  for (; i < N; i++)
    stmt;  //uid: 4
else
  for (; i < N; i++)
    stmt;  //uid: 5

So basically, each clone of the loop will have a uid, which you can use to
identify which version of the loop the instruction was coming from.

>
>
>
>>
>> In short, I think that differentiating these different regions using the
>> descriminator seems like a natural choice, but "And we do not need to
>> introduce new building blocks to debug info" might not save us all that
>> much in the long run. To keep information on what regions correspond to
>> what optimizations, we may need to do that. That's not a bad thing, and I'd
>> rather we solve this in a way that is extensible. Plus, this might make it
>> easier to use fewer bits, thus helping the overall impact on the size of
>> the debug sections.
>>
>
> I agree that if we want to extend this in the future, we need to have
> separate dwarf bits other than discriminator. For current use case,
> discriminator seem to be good enough. And if we encode efficiently, it will
> be better than introducing a new field. e.g., we can encode all info in a
> 1-byte ULEB128 85%~90% of the time, but for a new field, we will at least
> need 2 bytes if both discriminator and cloning info exists for an
> instruction.
>
> Is this because you need at least one more byte for the debug-info field?
>
> In general, I don't really care where we stuff the bits so long as we can
> get the necessary information back out. For a vectorized loop, for example,
> we should be able to figure out which counts go to the vectorized loop vs.
> the scalar loop. I don't, however, want to end up with something that is
> super-non-extensible (e.g. we have only a few bits, so the vectorizer can
> get one and the unroller can get one, but loop distribution is out of
> luck). Maybe we need an 'extension' bit saying that there is more
> information encoded elsewhere?
>

As illustrated in the above example, it is not like "vectorization has a
distinct bit". All different optimizations make clones of code which will
be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
will be capped by the number of clones all optimizations have made, instead
of # of optimizations that has applied. And it will be capped at 2^N-1. The
cons of using uid is that you will not know if a clone is coming from
vectorization or unroll or loop distribution.

Thanks,
Dehao

>
> Thanks again,
> Hal
>
>
> Dehao
>
>
>>
>> Thanks again,
>> Hal
>>
>> ------------------------------
>>
>> > From: "Dehao Chen via llvm-dev" <llvm-dev at lists.llvm.org>
>> > To: llvm-dev at lists.llvm.org
>> > Cc: "Xinliang David Li" <davidxl at google.com>
>> > Sent: Thursday, October 27, 2016 1:39:15 PM
>> > Subject: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>> >
>> > Motivation:
>> > Many optimizations duplicate code. E.g. loop unroller duplicates the
>> > loop body, GVN duplicates computation, etc. The duplicated code will
>> > share the same debug info with the original code. For SamplePGO, the
>> > debug info is used to present the profile. Code duplication will
>> > affect profile accuracy. Taking loop unrolling for example:
>> >
>> >
>> > #1 foo();
>> > #2 for (i = 0; i < N; i++) {
>> >
>> > #3 bar();
>> > #4 }
>> >
>> >
>> > If N is 8 during runtime, a reasonable profile will look like:
>> >
>> >
>> > #1: 10
>> > #3: 80
>> >
>> >
>> >
>> > But if the compiler unrolls the loop by a factor of 4, the callsite
>> > to bar() is duplicated 4 times and the profile will look like:
>> >
>> >
>> > #1: 10
>> > #3: 20
>> >
>> >
>> > The sample count for #3 is 20 because all 4 callsites to bar() are
>> > sampled 20 times each, and they shared the same debug loc (#3) so
>> > that 20 will be attributed to #3 (If one debugloc is mapped by
>> > multiple instructions, the max sample count of these instructions is
>> > used as debugloc's sample count).
>> >
>> >
>> > When loading this profile into compiler, it will think the loop trip
>> > count is 2 instead of 8.
>> >
>> >
>> > Proposal:
>> > When compiler duplicates code, it encodes the duplication info in the
>> > debug info. As the duplication is not interesting to debugger, I
>> > propose to encode this as part of the discriminator.
>> >
>> >
>> > There are 2 types of code duplication:
>> >
>> >
>> > 1. duplicated code are guaranteed to have the same execution count
>> > (e.g. loop unroll and loop vectorize). We can record the duplication
>> > factor, for the above example "4" is recorded in the discriminator.
>> > 2. duplicated code that may have different execution count (e.g. loop
>> > peel and gvn). For a same debugloc, a unique number is assigned to
>> > each copy and encoded in the discriminator.
>> >
>> >
>> > Assume that the discriminator is uint32. The traditional
>> > discriminator is less than 256, let's take 8 bit for it. For
>> > duplication factor (type 1 duplication), we assume the maximum
>> > unroll_factor * vectorize_factor is less than 256, thus 8 bit for
>> > it. For unique number(type 2 duplication), we assume code is at most
>> > duplicated 32 times, thus 5 bit for it. Overall, we still have 11
>> > free bits left in the discriminator encoding.
>> >
>> >
>> > Let's take the original source as an example, after loop unrolling
>> > and peeling, the code may looks like:
>> >
>> >
>> > for (i = 0; i < N & 3; i+= 4) {
>> > foo(); // discriminator: 0x40
>> > foo(); // discriminator: 0x40
>> > foo(); // discriminator: 0x40
>> > foo(); // discriminator: 0x40
>> > }
>> > if (i++ < N) {
>> > foo(); // discriminator: 0x100
>> > if (i++ < N) {
>> > foo(); // discriminator: 0x200
>> > if (i++ < N) {
>> > foo(); // discriminator: 0x300
>> > }
>> > }
>> > }
>> >
>> >
>> > The cost of this change would be increased debug_line size because:
>> > 1. we are recording more discriminators 2. discriminators become
>> > larger and will take more ULEB128 encoding.
>> >
>> >
>> > The benefit is that the sample pgo profile can accurately represent
>> > the code execution frequency. And we do not need to introduce new
>> > building blocks to debug info.
>> >
>> >
>> > Comments?
>> >
>> >
>> > Thanks,
>> > Dehao
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/5e4785fe/attachment-0001.html>