[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Tue Nov 1 11:27:44 PDT 2016

----- Original Message -----

> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Xinliang David Li" <davidxl at google.com>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Tuesday, November 1, 2016 1:24:01 PM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator

> On Tue, Nov 1, 2016 at 10:50 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:

> > > From: "Dehao Chen" < dehao at google.com >
> > 
> 
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > 
> 
> > > Cc: "Xinliang David Li" < davidxl at google.com >, "llvm-dev" <
> > > llvm-dev at lists.llvm.org >
> > 
> 
> > > Sent: Tuesday, November 1, 2016 11:43:41 AM
> > 
> 
> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> > 
> 

> > > On Fri, Oct 28, 2016 at 3:07 PM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > 
> 

> > > > Hi Dehao,
> > > 
> > 
> 

> > > > This is definitely an important problem, thanks for writing
> > > > this
> > > > up!
> > > 
> > 
> 

> > > > There is a related problem that I think we can address at the
> > > > same
> > > > time: When we multiversion code, for example when we use
> > > > runtime
> > > > checks to enable the creation of a vectorized loop while
> > > > retaining
> > > > the scalar loop, and then we collect profiling data, we should
> > > > be
> > > > able to recover the relative running time of the different
> > > > versions
> > > > of the loop. In the future, I suspect we'll end up with
> > > > transformations that produce even more versions of loops
> > > > (Intel's
> > > > compiler, for example, has a useful pragma that allows the user
> > > > to
> > > > request loop specializations for a specified set of trip
> > > > counts).
> > > 
> > 
> 

> > > > I'd like to have a scheme where the source location +
> > > > discriminator
> > > > can be mapped to information about the relevant loop version so
> > > > that
> > > > our profiling tools can display that information usefully to
> > > > the
> > > > user, and so that our optimizations can make useful choices
> > > > (i.e.
> > > > don't bother vectorizing a loop when the scalar loop is run a
> > > > lot
> > > > but the vectorized version almost never runs).
> > > 
> > 
> 

> > > That's definitely a valid and important use case, and it is
> > > important
> > > to sample pgo too. That's why I proposed to have " duplicated
> > > code
> > > that may have different execution count" being recorded. Will
> > > that
> > > suffice to get the info you want? (i.e. for every version of the
> > > multi-versioned loop, you will have a disincentive discriminator
> > > associated with all the code it expands.
> > 
> 
> > I don't know. Can you explain how the process will work? By the
> > time
> > the code/metadata arrives at, say, the loop vectorizer, how can we
> > tell whether the vectorized version we might now create will be
> > executed (based on having profiling data from a run where the
> > compiler might have previously made a similar choice)?
> 
> For example, for the following loop:

> for (i = 0; i < N; i++)
> stmt;

> After many loop optimizations it could become something like:

> if (vectorized1)
> for (i = 0;.....)
> stmt; //uid: 1
> else if (vectorized2)
> for (i = 0;.....)
> stmt; //uid: 2
> else if (unrolled)
> for (i = 0; i < N & (~0x3) i += 4)
> stmt; stmt; stmt; stmt; //uid: 3
> for (; i < N; i++)
> stmt; //uid: 4
> else
> for (; i < N; i++)
> stmt; //uid: 5

> So basically, each clone of the loop will have a uid, which you can
> use to identify which version of the loop the instruction was coming
> from.

> > > > In short, I think that differentiating these different regions
> > > > using
> > > > the descriminator seems like a natural choice, but "And we do
> > > > not
> > > > need to introduce new building blocks to debug info" might not
> > > > save
> > > > us all that much in the long run. To keep information on what
> > > > regions correspond to what optimizations, we may need to do
> > > > that.
> > > > That's not a bad thing, and I'd rather we solve this in a way
> > > > that
> > > > is extensible. Plus, this might make it easier to use fewer
> > > > bits,
> > > > thus helping the overall impact on the size of the debug
> > > > sections.
> > > 
> > 
> 

> > > I agree that if we want to extend this in the future, we need to
> > > have
> > > separate dwarf bits other than discriminator. For current use
> > > case,
> > > discriminator seem to be good enough. And if we encode
> > > efficiently,
> > > it will be better than introducing a new field. e.g., we can
> > > encode
> > > all info in a 1-byte ULEB128 85%~90% of the time, but for a new
> > > field, we will at least need 2 bytes if both discriminator and
> > > cloning info exists for an instruction.
> > 
> 
> > Is this because you need at least one more byte for the debug-info
> > field?
> 

> > In general, I don't really care where we stuff the bits so long as
> > we
> > can get the necessary information back out. For a vectorized loop,
> > for example, we should be able to figure out which counts go to the
> > vectorized loop vs. the scalar loop. I don't, however, want to end
> > up with something that is super-non-extensible (e.g. we have only a
> > few bits, so the vectorizer can get one and the unroller can get
> > one, but loop distribution is out of luck). Maybe we need an
> > 'extension' bit saying that there is more information encoded
> > elsewhere?
> 

> As illustrated in the above example, it is not like "vectorization
> has a distinct bit". All different optimizations make clones of code
> which will be labeled by UIDs represented by N (e.g. 8) bits. In
> this way, the space will be capped by the number of clones all
> optimizations have made, instead of # of optimizations that has
> applied. And it will be capped at 2^N-1. The cons of using uid is
> that you will not know if a clone is coming from vectorization or
> unroll or loop distribution.
Okay, but that kind of semantic mapping is important. How should we encode/recover that information? To be clear, I'm not saying that we need to implement that up front, but there needs to be a clear path to an implementation, because I don't want to have two disjoint schemes. 

Thanks again, 
Hal 

> Thanks,
> Dehao

> > Thanks again,
> 
> > Hal
> 

> > > Dehao
> > 
> 

> > > > Thanks again,
> > > 
> > 
> 
> > > > Hal
> > > 
> > 
> 

> > > > > From: "Dehao Chen via llvm-dev" < llvm-dev at lists.llvm.org >
> > > 
> > 
> 
> > > > > To: llvm-dev at lists.llvm.org
> > > 
> > 
> 
> > > > > Cc: "Xinliang David Li" < davidxl at google.com >
> > > 
> > 
> 
> > > > > Sent: Thursday, October 27, 2016 1:39:15 PM
> > > 
> > 
> 
> > > > > Subject: [llvm-dev] (RFC) Encoding code duplication factor in
> > > > > discriminator
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Motivation:
> > > 
> > 
> 
> > > > > Many optimizations duplicate code. E.g. loop unroller
> > > > > duplicates
> > > > > the
> > > 
> > 
> 
> > > > > loop body, GVN duplicates computation, etc. The duplicated
> > > > > code
> > > > > will
> > > 
> > 
> 
> > > > > share the same debug info with the original code. For
> > > > > SamplePGO,
> > > > > the
> > > 
> > 
> 
> > > > > debug info is used to present the profile. Code duplication
> > > > > will
> > > 
> > 
> 
> > > > > affect profile accuracy. Taking loop unrolling for example:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #1 foo();
> > > 
> > 
> 
> > > > > #2 for (i = 0; i < N; i++) {
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #3 bar();
> > > 
> > 
> 
> > > > > #4 }
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > If N is 8 during runtime, a reasonable profile will look
> > > > > like:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #1: 10
> > > 
> > 
> 
> > > > > #3: 80
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > But if the compiler unrolls the loop by a factor of 4, the
> > > > > callsite
> > > 
> > 
> 
> > > > > to bar() is duplicated 4 times and the profile will look
> > > > > like:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > #1: 10
> > > 
> > 
> 
> > > > > #3: 20
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > The sample count for #3 is 20 because all 4 callsites to
> > > > > bar()
> > > > > are
> > > 
> > 
> 
> > > > > sampled 20 times each, and they shared the same debug loc
> > > > > (#3)
> > > > > so
> > > 
> > 
> 
> > > > > that 20 will be attributed to #3 (If one debugloc is mapped
> > > > > by
> > > 
> > 
> 
> > > > > multiple instructions, the max sample count of these
> > > > > instructions
> > > > > is
> > > 
> > 
> 
> > > > > used as debugloc's sample count).
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > When loading this profile into compiler, it will think the
> > > > > loop
> > > > > trip
> > > 
> > 
> 
> > > > > count is 2 instead of 8.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Proposal:
> > > 
> > 
> 
> > > > > When compiler duplicates code, it encodes the duplication
> > > > > info
> > > > > in
> > > > > the
> > > 
> > 
> 
> > > > > debug info. As the duplication is not interesting to
> > > > > debugger,
> > > > > I
> > > 
> > 
> 
> > > > > propose to encode this as part of the discriminator.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > There are 2 types of code duplication:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > 1. duplicated code are guaranteed to have the same execution
> > > > > count
> > > 
> > 
> 
> > > > > (e.g. loop unroll and loop vectorize). We can record the
> > > > > duplication
> > > 
> > 
> 
> > > > > factor, for the above example "4" is recorded in the
> > > > > discriminator.
> > > 
> > 
> 
> > > > > 2. duplicated code that may have different execution count
> > > > > (e.g.
> > > > > loop
> > > 
> > 
> 
> > > > > peel and gvn). For a same debugloc, a unique number is
> > > > > assigned
> > > > > to
> > > 
> > 
> 
> > > > > each copy and encoded in the discriminator.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Assume that the discriminator is uint32. The traditional
> > > 
> > 
> 
> > > > > discriminator is less than 256, let's take 8 bit for it. For
> > > 
> > 
> 
> > > > > duplication factor (type 1 duplication), we assume the
> > > > > maximum
> > > 
> > 
> 
> > > > > unroll_factor * vectorize_factor is less than 256, thus 8 bit
> > > > > for
> > > 
> > 
> 
> > > > > it. For unique number(type 2 duplication), we assume code is
> > > > > at
> > > > > most
> > > 
> > 
> 
> > > > > duplicated 32 times, thus 5 bit for it. Overall, we still
> > > > > have
> > > > > 11
> > > 
> > 
> 
> > > > > free bits left in the discriminator encoding.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Let's take the original source as an example, after loop
> > > > > unrolling
> > > 
> > 
> 
> > > > > and peeling, the code may looks like:
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > for (i = 0; i < N & 3; i+= 4) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x40
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > > if (i++ < N) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x100
> > > 
> > 
> 
> > > > > if (i++ < N) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x200
> > > 
> > 
> 
> > > > > if (i++ < N) {
> > > 
> > 
> 
> > > > > foo(); // discriminator: 0x300
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > > }
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > The cost of this change would be increased debug_line size
> > > > > because:
> > > 
> > 
> 
> > > > > 1. we are recording more discriminators 2. discriminators
> > > > > become
> > > 
> > 
> 
> > > > > larger and will take more ULEB128 encoding.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > The benefit is that the sample pgo profile can accurately
> > > > > represent
> > > 
> > 
> 
> > > > > the code execution frequency. And we do not need to introduce
> > > > > new
> > > 
> > 
> 
> > > > > building blocks to debug info.
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Comments?
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 
> > > > > Thanks,
> > > 
> > 
> 
> > > > > Dehao
> > > 
> > 
> 

> > > > > _______________________________________________
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > 
> > 
> 
> > > > > llvm-dev at lists.llvm.org
> > > 
> > 
> 
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > 
> > 
> 
> > > > >
> > > 
> > 
> 

> > > > --
> > > 
> > 
> 
> > > > Hal Finkel
> > > 
> > 
> 
> > > > Lead, Compiler Technology and Programming Languages
> > > 
> > 
> 
> > > > Leadership Computing Facility
> > > 
> > 
> 
> > > > Argonne National Laboratory
> > > 
> > 
> 

> > --
> 

> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/340b759a/attachment.html>