[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Tue Nov 1 17:56:15 PDT 2016

----- Original Message -----

> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li"
> <davidxl at google.com>
> Sent: Tuesday, November 1, 2016 6:41:29 PM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator

> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel < hfinkel at anl.gov > wrote:

> > > From: "Hal Finkel via llvm-dev" < llvm-dev at lists.llvm.org >
> > 
> 
> > > To: "Dehao Chen" < dehao at google.com >
> > 
> 
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >, "Xinliang David Li" <
> > > davidxl at google.com >
> > 
> 
> > > Sent: Tuesday, November 1, 2016 4:26:17 PM
> > 
> 
> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> > 
> 

> > > > From: "Dehao Chen" < dehao at google.com >
> > > 
> > 
> 
> > > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > 
> > 
> 
> > > > Cc: "Paul Robinson" < paul.robinson at sony.com >, "Xinliang David
> > > > Li"
> > > > <
> > > > davidxl at google.com >, "llvm-dev" < llvm-dev at lists.llvm.org >
> > > 
> > 
> 
> > > > Sent: Tuesday, November 1, 2016 4:14:43 PM
> > > 
> > 
> 
> > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor
> > > > in
> > > > discriminator
> > > 
> > 
> 

> > > > damn... my english is not readable at all when I try to write
> > > > fast...
> > > > trying to make some clarification below, hopefully can make it
> > > > more
> > > > readable...
> > > 
> > 
> 

> > > > On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen < dehao at google.com >
> > > > wrote:
> > > 
> > 
> 

> > > > > Oops... pressed the wrong button and sent out early...
> > > > 
> > > 
> > 
> 

> > > > > On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen < dehao at google.com
> > > > > >
> > > > > wrote:
> > > > 
> > > 
> > 
> 

> > > > > > If Hal's proposal is for SamplePGO purpose, let me clarify
> > > > > > some
> > > > > > design principles of SamplePGO.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > The profile for sample pgo uses source location as the key
> > > > > > to
> > > > > > map
> > > > > > the
> > > > > > execution count back to IR. This design is based on the
> > > > > > principle
> > > > > > that we do not want the profile to be tightly couple with
> > > > > > compiler
> > > > > > IR. Instead, profile is simple an attribute of the source
> > > > > > code.
> > > > > > We
> > > > > > have been benefited a lot from this design that the profile
> > > > > > can
> > > > > > easily be reused across different source versions and
> > > > > > compiler
> > > > > > versions, or even compilers.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > That being said, the design to encode more info into
> > > > > > discriminator
> > > > > > does not mean that we will change the profile. The encoded
> > > > > > info
> > > > > > in
> > > > > > discriminator will be handled by the create_llvm_prof tool,
> > > > > > which
> > > > > > combines counts from different clones of the same source
> > > > > > code
> > > > > > and
> > > > > > generate the combined profile data. The output profile will
> > > > > > not
> > > > > > have
> > > > > > any cloning/dupliaction bits at all. So for the initial
> > > > > > example
> > > > > > profile I provided, the output profile will be:
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > #1: 10
> > > > 
> > > 
> > 
> 
> > > > > #3: 80
> > > > 
> > > 
> > 
> 

> > > > > Not:
> > > > 
> > > 
> > 
> 

> > > > > #1: 10
> > > > 
> > > 
> > 
> 
> > > > > #3.0x400: 70
> > > > 
> > > 
> > 
> 

> > > > > #3.0x10400: 5
> > > > 
> > > 
> > 
> 
> > > > > #3.0x20400: 3
> > > > 
> > > 
> > 
> 
> > > > > #3.0x30400: 2
> > > > 
> > > 
> > 
> 

> > Also, how does this work for vectorization? For vectorization,
> > you're
> > going to multiply by the duplication factor first? The issue
> > obviously is that each vector instruction counts for VF times as
> > many scalar instructions, and so to get equivalent scalar counts,
> > you need to multiply by VF. I had assumed this is what you were
> > saying when I read the initial e-mail, but if you're also using the
> > same duplication scheme for unrolling, then I think we need some
> > way
> > to differentiate.
> 
> In my original proposal, I only record VF*UF if the clone is both
> unrolled and vectorized. I did not distinguish between unroll and
> vectorization because I the unroll/vectorize decision only depends
> on the trip count, which can be derived from the duplication factor.
> If more profile info can be used for better unroll/vectorize
> decision making, we can definitely add it.
I'm still missing something here. In your proposed system, does the tool collecting the profiling information *interpret* the duplication factor encoded in the descriminators at all? If it does, it seems like this should be specific to vectorization. For unrolling, you still have the same number of instructions, and so you just need to make the different instructions from the different loop iterations carry different discriminator values (so they they'll sum together, instead of using the maximum, as you explained in your original proposal). 

> > -Hal
> 

> > > > > The goal of the proposed change, is to make profile more
> > > > > accurately
> > > > > represent the attribute of the source.
> > > > 
> > > 
> > 
> 
> > > > > The non-goal of the proposed change, is to provide more
> > > > > context
> > > > > in
> > > > > the profile to present the behavior of program in the context
> > > > > of
> > > > > different context.
> > > > 
> > > 
> > 
> 
> > > > The non-goal of the proposed change, is to provide more context
> > > > in
> > > > the profile to present the behavior of program in different
> > > > contexts.'
> > > 
> > 
> 
> > > Okay, but why? The information will be present in the profiles,
> > > why
> > > not encode it in the metadata and let the optimizer decide when
> > > to
> > > sum it up? We should even provide some utility functions that
> > > make
> > > this transparent for passes that don't care about the
> > > distinction.
> > 
> 

> I see, you are proposing encoding these profile in the metadata. I
> agree that this can work, but from implementation's perspective, it
> could could lead to great complexity.

> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the
> raw count, but only the branch probability is stored after sample
> pgo pass.

I think that, as you indicate below, we'd need to change this to {source, disc}->count map --> {BB, disc}->count map --> {branch, disc}->prob map. The idea being that the total probability for the branch is the sum of the probabilities associated with it for each discriminator value. Most users will just sum them up, but some passes will want the breakdowns. 

> > > > > If we are pursuing more context-sensitive profile, that would
> > > > > be
> > > > > a
> > > > > very different design. In this design, we will need to read
> > > > > profile
> > > > > multiple times, and have the profile tightly coupled with the
> > > > > compiler IR/pass manager. That is doable, but I don't think
> > > > > that
> > > > > probably better suits instrumentation based PGO's domain.
> > > > > Comments?
> > > > 
> > > 
> > 
> 
> > > > If we are pursuing more context-sensitive profile, that would
> > > > be
> > > > a
> > > > very different design in which we need to read in profiles
> > > > multiple
> > > > times, and have the profile tightly coupled with the compiler
> > > > IR/pass manager. That is doable, but that probably better suits
> > > > instrumentation based PGO's domain. Comments?
> > > 
> > 
> 
> > > I don't understand why you say we'd need to read the profile
> > > multiple
> > > times. Can you please explain? I also don't think it needs to be
> > > that tightly coupled; we just need each pass that generates
> > > multiple
> > > copies of things to have some way to generate a unique id for
> > > what
> > > it's doing. It's all per source location, so I don't even think
> > > we
> > > need any kind of complicated hashing scheme.
> > 
> 

> As explained above, we only have branch probability. Yes, we can
> record more instruction/basicblock count context info in additional
> metadata. But when you use it in optimizer, I suppose you will need
> to convert it to branch probability too?

> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.

> original code:
> for ()
> stmt1;

> optimized code:
> if (cond1)
> for()
> stmt1; //100 samples
> else if (cond2)
> for()
> stmt1; // 0 samples
> else
> for()
> stmt1; // 50 samples

> The loop was multi-versioned by cond1 and cond2. With my proposal,
> the profile for stmt1 will be combined as 150 samples. If we use
> this profile to optimize the orginal source, after annotation, the
> code becomes:

> for()
> stmt1 // 150 samples

> Later when it comes to the mulit-version optimization, it will
> transform the code to:

> if (cond1)
> for()
> stmt1; //50 samples
> else if (cond2)
> for()
> stmt1; // 50 samples
> else
> for()
> stmt1; // 50 samples

> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.

> With your proposal, after annotation, the original code becomes:

> for()
> stmt1 // 150 samples (M1:100, M2:0, M3:50)

> Then during mutli-version optimization, the optimizer reads the
> metadata and find M2 is never taken, so it will transform the code
> to:

> if (cond1)
> for()
> stmt1; // 100 samples
> else

> for()
> stmt1; // 50 samples

> There are two major parts that can benefit from the extra metadata
> recorded for each copy.

> 1. It can demonstrate that cond2 is always false and there is no need
> to create that version.
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)

> I think the most important benefit is #1. For #2, my guess is scaled
> count is already good enough.
By what are you scaling the count? 

> Here comes the problem: cond1 and cond2 are both generated by
> compiler. Then how do we map the metadata back to them? Using uid or
> special encoding seems fragile because when upstream optimization
> changes, compiler may choose to put cond2 in front of cond1 so the
> uid will change accordingly. How would we handle cases like this?
Exactly for this reason, I don't think we can rely on the ordering to generate the identifiers; they need to have semantic meaning. I agree that this is tricky in general. However, for the cases I have in mind (unrolling, vectorization, etc.) these passes don't generate arbitrary conditionals, but rather, specific conditionals related to runtime overlap checks, trip-count thresholds, and I think that it should be easy to encode a discriminator value which can be unambiguously inverted. 

Thanks again, 
Hal 

> Dehao

> > > Thanks again,
> > 
> 
> > > Hal
> > 
> 

> > > > > Thanks,
> > > > 
> > > 
> > 
> 
> > > > > Dehao
> > > > 
> > > 
> > 
> 

> > > > > > On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <
> > > > > > hfinkel at anl.gov
> > > > > > >
> > > > > > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > From: "Paul Robinson" < paul.robinson at sony.com >
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > To: "Dehao Chen" < dehao at google.com >, "Hal Finkel" <
> > > > > > > > hfinkel at anl.gov
> > > > > > > > >
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Cc: "Xinliang David Li" < davidxl at google.com >,
> > > > > > > > llvm-dev at lists.llvm.org
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Sent: Tuesday, November 1, 2016 2:15:38 PM
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Subject: RE: [llvm-dev] (RFC) Encoding code duplication
> > > > > > > > factor
> > > > > > > > in
> > > > > > > > discriminator
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > As illustrated in the above example, it is not like
> > > > > > > > "vectorization
> > > > > > > > has a distinct bit". All different optimizations make
> > > > > > > > clones
> > > > > > > > of
> > > > > > > > code
> > > > > > > > which will be labeled by UIDs represented by N (e.g. 8)
> > > > > > > > bits.
> > > > > > > > In
> > > > > > > > this way, the space will be capped by the number of
> > > > > > > > clones
> > > > > > > > all
> > > > > > > > optimizations have made, instead of # of optimizations
> > > > > > > > that
> > > > > > > > has
> > > > > > > > applied. And it will be capped at 2^N-1. The cons of
> > > > > > > > using
> > > > > > > > uid
> > > > > > > > is
> > > > > > > > that you will not know if a clone is coming from
> > > > > > > > vectorization
> > > > > > > > or
> > > > > > > > unroll or loop distribution.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Okay, but that kind of semantic mapping is important.
> > > > > > > > How
> > > > > > > > should
> > > > > > > > we
> > > > > > > > encode/recover that information? To be clear, I'm not
> > > > > > > > saying
> > > > > > > > that
> > > > > > > > we
> > > > > > > > need to implement that up front, but there needs to be
> > > > > > > > a
> > > > > > > > clear
> > > > > > > > path
> > > > > > > > to an implementation, because I don't want to have two
> > > > > > > > disjoint
> > > > > > > > schemes.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > You mean that you want to know which optimization
> > > > > > > > created
> > > > > > > > the
> > > > > > > > clone?
> > > > > > > > How would you use that info? Looks to me this will
> > > > > > > > expose
> > > > > > > > compiler
> > > > > > > > implementation detail in debug info.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > This is still doable, assume we have 15 interesting
> > > > > > > > optimizations
> > > > > > > > to
> > > > > > > > track, we can use 4 bits to encode the optimization
> > > > > > > > type
> > > > > > > > that
> > > > > > > > created the clone. But this becomes nasty if the a
> > > > > > > > clone
> > > > > > > > is
> > > > > > > > created
> > > > > > > > by more than one optimizations. In that way,
> > > > > > > > discriminator
> > > > > > > > may
> > > > > > > > not
> > > > > > > > be fit for this purpose.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > My understanding was that the encoding scheme would
> > > > > > > > allow
> > > > > > > > the
> > > > > > > > profiling analysis to correctly map execution data back
> > > > > > > > to
> > > > > > > > the
> > > > > > > > original source construct, while preserving the
> > > > > > > > property
> > > > > > > > that
> > > > > > > > each
> > > > > > > > distinct basic block would have its own discriminator
> > > > > > > > value.
> > > > > > > > That
> > > > > > > > is, the execution data would be attributed back to the
> > > > > > > > original
> > > > > > > > source construct, not whatever each individual
> > > > > > > > optimization
> > > > > > > > had
> > > > > > > > done
> > > > > > > > to it, and the data for the original source construct
> > > > > > > > would
> > > > > > > > correctly reflect the execution (e.g. profiling says
> > > > > > > > you
> > > > > > > > got
> > > > > > > > 82
> > > > > > > > hits
> > > > > > > > on the original loop, rather than reporting 20 hits on
> > > > > > > > the
> > > > > > > > unrolled-by-4 loop plus 1 each on 2 of the trailing
> > > > > > > > copies).
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > It sounds like Hal is thinking that the
> > > > > > > > per-discriminator
> > > > > > > > execution
> > > > > > > > info would be preserved down to the point where an
> > > > > > > > individual
> > > > > > > > optimization could look at the profile for each piece,
> > > > > > > > and
> > > > > > > > make
> > > > > > > > decisions on that basis.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > I'm not clear how that would be possible, as the
> > > > > > > > optimization
> > > > > > > > would
> > > > > > > > have to first do the transform (or predict how it would
> > > > > > > > do
> > > > > > > > the
> > > > > > > > transform) in order to see which
> > > > > > > > individual-discriminator
> > > > > > > > counts
> > > > > > > > mapped to which actual blocks, and then make some kind
> > > > > > > > of
> > > > > > > > decision
> > > > > > > > about whether to do the transform differently based on
> > > > > > > > that
> > > > > > > > information. Then, if the optimization did choose to do
> > > > > > > > the
> > > > > > > > transform differently, then that leaves the IR in a
> > > > > > > > state
> > > > > > > > where
> > > > > > > > the
> > > > > > > > individual discriminators *cannot* map back to it. (Say
> > > > > > > > you
> > > > > > > > unroll
> > > > > > > > by 2 instead of 4; then you have only 1 trailing copy,
> > > > > > > > not
> > > > > > > > 3,
> > > > > > > > and
> > > > > > > > a
> > > > > > > > discriminator that maps to the second trailing copy now
> > > > > > > > maps
> > > > > > > > to
> > > > > > > > nothing. The individual-discriminator data becomes
> > > > > > > > useless.)
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > Am I expressing this well enough to show that what Hal
> > > > > > > > is
> > > > > > > > looking
> > > > > > > > for
> > > > > > > > is not feasible?
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Yes, it will need to predict how the transformation would
> > > > > > > affect
> > > > > > > the
> > > > > > > blocks produced. That does not seem problematic (at least
> > > > > > > at
> > > > > > > a
> > > > > > > coarse level). Yes, if transformations made earlier in
> > > > > > > the
> > > > > > > pipeline
> > > > > > > make different decisions, then that will invalidate later
> > > > > > > fine-grained data (at least potentially). I don't see how
> > > > > > > any
> > > > > > > of
> > > > > > > this makes this infeasible. We just need a way for the
> > > > > > > profiling
> > > > > > > counts, per descriminator, to remain available, and for
> > > > > > > the
> > > > > > > transformations themselves to know which discriminators
> > > > > > > (loop
> > > > > > > ids,
> > > > > > > or whatever) to consider.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > -Hal
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > --paulr
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > --
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > Hal Finkel
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Lead, Compiler Technology and Programming Languages
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Leadership Computing Facility
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Argonne National Laboratory
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > --
> > 
> 

> > > Hal Finkel
> > 
> 
> > > Lead, Compiler Technology and Programming Languages
> > 
> 
> > > Leadership Computing Facility
> > 
> 
> > > Argonne National Laboratory
> > 
> 

> > > _______________________________________________
> > 
> 
> > > LLVM Developers mailing list
> > 
> 
> > > llvm-dev at lists.llvm.org
> > 
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 

> > --
> 

> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/2534c9e7/attachment-0001.html>