[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Tue Nov 1 19:35:25 PDT 2016

----- Original Message -----

> From: "Dehao Chen" <dehao at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li"
> <davidxl at google.com>
> Sent: Tuesday, November 1, 2016 8:24:30 PM
> Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator

> On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel < hfinkel at anl.gov > wrote:

> > > From: "Dehao Chen" < dehao at google.com >
> > 
> 
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > 
> 
> > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >, "Xinliang David Li" <
> > > davidxl at google.com >
> > 
> 
> > > Sent: Tuesday, November 1, 2016 6:41:29 PM
> > 
> 

> > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication factor in
> > > discriminator
> > 
> 

> > > On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > 
> 

> > > > > From: "Hal Finkel via llvm-dev" < llvm-dev at lists.llvm.org >
> > > > 
> > > 
> > 
> 
> > > > > To: "Dehao Chen" < dehao at google.com >
> > > > 
> > > 
> > 
> 
> > > > > Cc: "llvm-dev" < llvm-dev at lists.llvm.org >, "Xinliang David
> > > > > Li"
> > > > > <
> > > > > davidxl at google.com >
> > > > 
> > > 
> > 
> 
> > > > > Sent: Tuesday, November 1, 2016 4:26:17 PM
> > > > 
> > > 
> > 
> 
> > > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication
> > > > > factor
> > > > > in
> > > > > discriminator
> > > > 
> > > 
> > 
> 

> > > > > > From: "Dehao Chen" < dehao at google.com >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Cc: "Paul Robinson" < paul.robinson at sony.com >, "Xinliang
> > > > > > David
> > > > > > Li"
> > > > > > <
> > > > > > davidxl at google.com >, "llvm-dev" < llvm-dev at lists.llvm.org
> > > > > > >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Sent: Tuesday, November 1, 2016 4:14:43 PM
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Subject: Re: [llvm-dev] (RFC) Encoding code duplication
> > > > > > factor
> > > > > > in
> > > > > > discriminator
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > damn... my english is not readable at all when I try to
> > > > > > write
> > > > > > fast...
> > > > > > trying to make some clarification below, hopefully can make
> > > > > > it
> > > > > > more
> > > > > > readable...
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <
> > > > > > dehao at google.com
> > > > > > >
> > > > > > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > Oops... pressed the wrong button and sent out early...
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <
> > > > > > > dehao at google.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > If Hal's proposal is for SamplePGO purpose, let me
> > > > > > > > clarify
> > > > > > > > some
> > > > > > > > design principles of SamplePGO.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > The profile for sample pgo uses source location as the
> > > > > > > > key
> > > > > > > > to
> > > > > > > > map
> > > > > > > > the
> > > > > > > > execution count back to IR. This design is based on the
> > > > > > > > principle
> > > > > > > > that we do not want the profile to be tightly couple
> > > > > > > > with
> > > > > > > > compiler
> > > > > > > > IR. Instead, profile is simple an attribute of the
> > > > > > > > source
> > > > > > > > code.
> > > > > > > > We
> > > > > > > > have been benefited a lot from this design that the
> > > > > > > > profile
> > > > > > > > can
> > > > > > > > easily be reused across different source versions and
> > > > > > > > compiler
> > > > > > > > versions, or even compilers.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > That being said, the design to encode more info into
> > > > > > > > discriminator
> > > > > > > > does not mean that we will change the profile. The
> > > > > > > > encoded
> > > > > > > > info
> > > > > > > > in
> > > > > > > > discriminator will be handled by the create_llvm_prof
> > > > > > > > tool,
> > > > > > > > which
> > > > > > > > combines counts from different clones of the same
> > > > > > > > source
> > > > > > > > code
> > > > > > > > and
> > > > > > > > generate the combined profile data. The output profile
> > > > > > > > will
> > > > > > > > not
> > > > > > > > have
> > > > > > > > any cloning/dupliaction bits at all. So for the initial
> > > > > > > > example
> > > > > > > > profile I provided, the output profile will be:
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #1: 10
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3: 80
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > Not:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > #1: 10
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x400: 70
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > #3.0x10400: 5
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x20400: 3
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > #3.0x30400: 2
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > Also, how does this work for vectorization? For vectorization,
> > > > you're
> > > > going to multiply by the duplication factor first? The issue
> > > > obviously is that each vector instruction counts for VF times
> > > > as
> > > > many scalar instructions, and so to get equivalent scalar
> > > > counts,
> > > > you need to multiply by VF. I had assumed this is what you were
> > > > saying when I read the initial e-mail, but if you're also using
> > > > the
> > > > same duplication scheme for unrolling, then I think we need
> > > > some
> > > > way
> > > > to differentiate.
> > > 
> > 
> 
> > > In my original proposal, I only record VF*UF if the clone is both
> > > unrolled and vectorized. I did not distinguish between unroll and
> > > vectorization because I the unroll/vectorize decision only
> > > depends
> > > on the trip count, which can be derived from the duplication
> > > factor.
> > > If more profile info can be used for better unroll/vectorize
> > > decision making, we can definitely add it.
> > 
> 
> > I'm still missing something here. In your proposed system, does the
> > tool collecting the profiling information *interpret* the
> > duplication factor encoded in the descriminators at all?
> 
> Yes it does.
Okay. I'm still missing something. Can you please write out an example using your notation below for a vectorized loop? 

> > If it does, it seems like this should be specific to vectorization.
> > For unrolling, you still have the same number of instructions, and
> > so you just need to make the different instructions from the
> > different loop iterations carry different discriminator values (so
> > they they'll sum together, instead of using the maximum, as you
> > explained in your original proposal).
> 
> Let's take a look at the original loop unroll example. My proposed
> discriminator assignment is:

> for (i = 0; i < N & 3; i+= 4) {
> foo(); // discriminator: 0x400
> foo(); // discriminator: 0x400
> foo(); // discriminator: 0x400
> foo(); // discriminator: 0x400
> }
> if (i++ < N) {
> foo(); // discriminator: 0x10000
> if (i++ < N) {
> foo(); // discriminator: 0x20000
> if (i++ < N) {
> foo(); // discriminator: 0x30000
> }
> }
> }

> IIRC, your proposed discriminator assignment is:
To be clear, I don't think that I've proposed any assignment scheme at all. Regardless, I was asking this question purely in the context of your proposal. 

Thanks again, 
Hal 

> for (i = 0; i < N & 3; i+= 4) {
> foo(); // discriminator: 0x10000
> foo(); // discriminator: 0x20000
> foo(); // discriminator: 0x30000
> foo(); // discriminator: 0x40000
> }
> if (i++ < N) {
> foo(); // discriminator: 0x50000
> if (i++ < N) {
> foo(); // discriminator: 0x60000
> if (i++ < N) {
> foo(); // discriminator: 0x70000
> }
> }
> }

> I think my method can save discriminator space because all clones
> share the same discriminator. Consider that if a loop is unrolled
> 100 times, then with your algorithm, we may end up having 100
> different discriminators, which seem not good for debug info size.
> > > > -Hal
> > > 
> > 
> 

> > > > > > > The goal of the proposed change, is to make profile more
> > > > > > > accurately
> > > > > > > represent the attribute of the source.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > The non-goal of the proposed change, is to provide more
> > > > > > > context
> > > > > > > in
> > > > > > > the profile to present the behavior of program in the
> > > > > > > context
> > > > > > > of
> > > > > > > different context.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > The non-goal of the proposed change, is to provide more
> > > > > > context
> > > > > > in
> > > > > > the profile to present the behavior of program in different
> > > > > > contexts.'
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > Okay, but why? The information will be present in the
> > > > > profiles,
> > > > > why
> > > > > not encode it in the metadata and let the optimizer decide
> > > > > when
> > > > > to
> > > > > sum it up? We should even provide some utility functions that
> > > > > make
> > > > > this transparent for passes that don't care about the
> > > > > distinction.
> > > > 
> > > 
> > 
> 

> > > I see, you are proposing encoding these profile in the metadata.
> > > I
> > > agree that this can work, but from implementation's perspective,
> > > it
> > > could could lead to great complexity.
> > 
> 

> > > Sample PGO pass first read in the source->count map, use it to
> > > get
> > > basic_block->count map, and then use heuristics to propagate and
> > > get
> > > branch->probability count. So we don't have Metadata to store the
> > > raw count, but only the branch probability is stored after sample
> > > pgo pass.
> > 
> 

> > I think that, as you indicate below, we'd need to change this to
> > {source, disc}->count map --> {BB, disc}->count map --> {branch,
> > disc}->prob map. The idea being that the total probability for the
> > branch is the sum of the probabilities associated with it for each
> > discriminator value. Most users will just sum them up, but some
> > passes will want the breakdowns.
> 
> Agree. Note that this would add some complexity to BFI/BPI

> > > > > > > If we are pursuing more context-sensitive profile, that
> > > > > > > would
> > > > > > > be
> > > > > > > a
> > > > > > > very different design. In this design, we will need to
> > > > > > > read
> > > > > > > profile
> > > > > > > multiple times, and have the profile tightly coupled with
> > > > > > > the
> > > > > > > compiler IR/pass manager. That is doable, but I don't
> > > > > > > think
> > > > > > > that
> > > > > > > probably better suits instrumentation based PGO's domain.
> > > > > > > Comments?
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > If we are pursuing more context-sensitive profile, that
> > > > > > would
> > > > > > be
> > > > > > a
> > > > > > very different design in which we need to read in profiles
> > > > > > multiple
> > > > > > times, and have the profile tightly coupled with the
> > > > > > compiler
> > > > > > IR/pass manager. That is doable, but that probably better
> > > > > > suits
> > > > > > instrumentation based PGO's domain. Comments?
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > I don't understand why you say we'd need to read the profile
> > > > > multiple
> > > > > times. Can you please explain? I also don't think it needs to
> > > > > be
> > > > > that tightly coupled; we just need each pass that generates
> > > > > multiple
> > > > > copies of things to have some way to generate a unique id for
> > > > > what
> > > > > it's doing. It's all per source location, so I don't even
> > > > > think
> > > > > we
> > > > > need any kind of complicated hashing scheme.
> > > > 
> > > 
> > 
> 

> > > As explained above, we only have branch probability. Yes, we can
> > > record more instruction/basicblock count context info in
> > > additional
> > > metadata. But when you use it in optimizer, I suppose you will
> > > need
> > > to convert it to branch probability too?
> > 
> 

> > > Let me take an example to demonstrate my understanding of your
> > > use
> > > of
> > > context-sensitive profile.
> > 
> 

> > > original code:
> > 
> 
> > > for ()
> > 
> 
> > > stmt1;
> > 
> 

> > > optimized code:
> > 
> 
> > > if (cond1)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; //100 samples
> > 
> 
> > > else if (cond2)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 0 samples
> > 
> 
> > > else
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 

> > > The loop was multi-versioned by cond1 and cond2. With my
> > > proposal,
> > > the profile for stmt1 will be combined as 150 samples. If we use
> > > this profile to optimize the orginal source, after annotation,
> > > the
> > > code becomes:
> > 
> 

> > > for()
> > 
> 
> > > stmt1 // 150 samples
> > 
> 

> > > Later when it comes to the mulit-version optimization, it will
> > > transform the code to:
> > 
> 

> > > if (cond1)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; //50 samples
> > 
> 
> > > else if (cond2)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 
> > > else
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 

> > > The 150 samples are evenly distributed to 3 clones as we do not
> > > have
> > > context info.
> > 
> 

> > > With your proposal, after annotation, the original code becomes:
> > 
> 

> > > for()
> > 
> 
> > > stmt1 // 150 samples (M1:100, M2:0, M3:50)
> > 
> 

> > > Then during mutli-version optimization, the optimizer reads the
> > > metadata and find M2 is never taken, so it will transform the
> > > code
> > > to:
> > 
> 

> > > if (cond1)
> > 
> 
> > > for()
> > 
> 
> > > stmt1; // 100 samples
> > 
> 
> > > else
> > 
> 

> > > for()
> > 
> 
> > > stmt1; // 50 samples
> > 
> 

> > > There are two major parts that can benefit from the extra
> > > metadata
> > > recorded for each copy.
> > 
> 

> > > 1. It can demonstrate that cond2 is always false and there is no
> > > need
> > > to create that version.
> > 
> 
> > > 2. It can correctly attribute samples to cloned copy (i.e. 100,
> > > 50
> > > respectively in this case)
> > 
> 

> > > I think the most important benefit is #1. For #2, my guess is
> > > scaled
> > > count is already good enough.
> > 
> 
> > By what are you scaling the count?
> 
> It's random. Or most likely, each branch will be 50% taken.

> > > Here comes the problem: cond1 and cond2 are both generated by
> > > compiler. Then how do we map the metadata back to them? Using uid
> > > or
> > > special encoding seems fragile because when upstream optimization
> > > changes, compiler may choose to put cond2 in front of cond1 so
> > > the
> > > uid will change accordingly. How would we handle cases like this?
> > 
> 
> > Exactly for this reason, I don't think we can rely on the ordering
> > to
> > generate the identifiers; they need to have semantic meaning. I
> > agree that this is tricky in general. However, for the cases I have
> > in mind (unrolling, vectorization, etc.) these passes don't
> > generate
> > arbitrary conditionals, but rather, specific conditionals related
> > to
> > runtime overlap checks, trip-count thresholds, and I think that it
> > should be easy to encode a discriminator value which can be
> > unambiguously inverted.
> 

> I'm still not quite clear how to do this. Needs to think more
> thoroughly. But one thing I want to note is that, we want to encode
> the context info only when it will be useful for the optimization.
> For vectorization/unroll, I think accurate trip count is already
> good enough. So do we want to record its context?

> Thanks,
> Dehao

> > Thanks again,
> 
> > Hal
> 

> > > Dehao
> > 
> 

> > > > > Thanks again,
> > > > 
> > > 
> > 
> 
> > > > > Hal
> > > > 
> > > 
> > 
> 

> > > > > > > Thanks,
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Dehao
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <
> > > > > > > > hfinkel at anl.gov
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > From: "Paul Robinson" < paul.robinson at sony.com >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > To: "Dehao Chen" < dehao at google.com >, "Hal Finkel"
> > > > > > > > > > <
> > > > > > > > > > hfinkel at anl.gov
> > > > > > > > > > >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Cc: "Xinliang David Li" < davidxl at google.com >,
> > > > > > > > > > llvm-dev at lists.llvm.org
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Sent: Tuesday, November 1, 2016 2:15:38 PM
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Subject: RE: [llvm-dev] (RFC) Encoding code
> > > > > > > > > > duplication
> > > > > > > > > > factor
> > > > > > > > > > in
> > > > > > > > > > discriminator
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > As illustrated in the above example, it is not like
> > > > > > > > > > "vectorization
> > > > > > > > > > has a distinct bit". All different optimizations
> > > > > > > > > > make
> > > > > > > > > > clones
> > > > > > > > > > of
> > > > > > > > > > code
> > > > > > > > > > which will be labeled by UIDs represented by N
> > > > > > > > > > (e.g.
> > > > > > > > > > 8)
> > > > > > > > > > bits.
> > > > > > > > > > In
> > > > > > > > > > this way, the space will be capped by the number of
> > > > > > > > > > clones
> > > > > > > > > > all
> > > > > > > > > > optimizations have made, instead of # of
> > > > > > > > > > optimizations
> > > > > > > > > > that
> > > > > > > > > > has
> > > > > > > > > > applied. And it will be capped at 2^N-1. The cons
> > > > > > > > > > of
> > > > > > > > > > using
> > > > > > > > > > uid
> > > > > > > > > > is
> > > > > > > > > > that you will not know if a clone is coming from
> > > > > > > > > > vectorization
> > > > > > > > > > or
> > > > > > > > > > unroll or loop distribution.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Okay, but that kind of semantic mapping is
> > > > > > > > > > important.
> > > > > > > > > > How
> > > > > > > > > > should
> > > > > > > > > > we
> > > > > > > > > > encode/recover that information? To be clear, I'm
> > > > > > > > > > not
> > > > > > > > > > saying
> > > > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > need to implement that up front, but there needs to
> > > > > > > > > > be
> > > > > > > > > > a
> > > > > > > > > > clear
> > > > > > > > > > path
> > > > > > > > > > to an implementation, because I don't want to have
> > > > > > > > > > two
> > > > > > > > > > disjoint
> > > > > > > > > > schemes.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > You mean that you want to know which optimization
> > > > > > > > > > created
> > > > > > > > > > the
> > > > > > > > > > clone?
> > > > > > > > > > How would you use that info? Looks to me this will
> > > > > > > > > > expose
> > > > > > > > > > compiler
> > > > > > > > > > implementation detail in debug info.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > This is still doable, assume we have 15 interesting
> > > > > > > > > > optimizations
> > > > > > > > > > to
> > > > > > > > > > track, we can use 4 bits to encode the optimization
> > > > > > > > > > type
> > > > > > > > > > that
> > > > > > > > > > created the clone. But this becomes nasty if the a
> > > > > > > > > > clone
> > > > > > > > > > is
> > > > > > > > > > created
> > > > > > > > > > by more than one optimizations. In that way,
> > > > > > > > > > discriminator
> > > > > > > > > > may
> > > > > > > > > > not
> > > > > > > > > > be fit for this purpose.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > My understanding was that the encoding scheme would
> > > > > > > > > > allow
> > > > > > > > > > the
> > > > > > > > > > profiling analysis to correctly map execution data
> > > > > > > > > > back
> > > > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > original source construct, while preserving the
> > > > > > > > > > property
> > > > > > > > > > that
> > > > > > > > > > each
> > > > > > > > > > distinct basic block would have its own
> > > > > > > > > > discriminator
> > > > > > > > > > value.
> > > > > > > > > > That
> > > > > > > > > > is, the execution data would be attributed back to
> > > > > > > > > > the
> > > > > > > > > > original
> > > > > > > > > > source construct, not whatever each individual
> > > > > > > > > > optimization
> > > > > > > > > > had
> > > > > > > > > > done
> > > > > > > > > > to it, and the data for the original source
> > > > > > > > > > construct
> > > > > > > > > > would
> > > > > > > > > > correctly reflect the execution (e.g. profiling
> > > > > > > > > > says
> > > > > > > > > > you
> > > > > > > > > > got
> > > > > > > > > > 82
> > > > > > > > > > hits
> > > > > > > > > > on the original loop, rather than reporting 20 hits
> > > > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > unrolled-by-4 loop plus 1 each on 2 of the trailing
> > > > > > > > > > copies).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > It sounds like Hal is thinking that the
> > > > > > > > > > per-discriminator
> > > > > > > > > > execution
> > > > > > > > > > info would be preserved down to the point where an
> > > > > > > > > > individual
> > > > > > > > > > optimization could look at the profile for each
> > > > > > > > > > piece,
> > > > > > > > > > and
> > > > > > > > > > make
> > > > > > > > > > decisions on that basis.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > I'm not clear how that would be possible, as the
> > > > > > > > > > optimization
> > > > > > > > > > would
> > > > > > > > > > have to first do the transform (or predict how it
> > > > > > > > > > would
> > > > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > transform) in order to see which
> > > > > > > > > > individual-discriminator
> > > > > > > > > > counts
> > > > > > > > > > mapped to which actual blocks, and then make some
> > > > > > > > > > kind
> > > > > > > > > > of
> > > > > > > > > > decision
> > > > > > > > > > about whether to do the transform differently based
> > > > > > > > > > on
> > > > > > > > > > that
> > > > > > > > > > information. Then, if the optimization did choose
> > > > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > the
> > > > > > > > > > transform differently, then that leaves the IR in a
> > > > > > > > > > state
> > > > > > > > > > where
> > > > > > > > > > the
> > > > > > > > > > individual discriminators *cannot* map back to it.
> > > > > > > > > > (Say
> > > > > > > > > > you
> > > > > > > > > > unroll
> > > > > > > > > > by 2 instead of 4; then you have only 1 trailing
> > > > > > > > > > copy,
> > > > > > > > > > not
> > > > > > > > > > 3,
> > > > > > > > > > and
> > > > > > > > > > a
> > > > > > > > > > discriminator that maps to the second trailing copy
> > > > > > > > > > now
> > > > > > > > > > maps
> > > > > > > > > > to
> > > > > > > > > > nothing. The individual-discriminator data becomes
> > > > > > > > > > useless.)
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > Am I expressing this well enough to show that what
> > > > > > > > > > Hal
> > > > > > > > > > is
> > > > > > > > > > looking
> > > > > > > > > > for
> > > > > > > > > > is not feasible?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Yes, it will need to predict how the transformation
> > > > > > > > > would
> > > > > > > > > affect
> > > > > > > > > the
> > > > > > > > > blocks produced. That does not seem problematic (at
> > > > > > > > > least
> > > > > > > > > at
> > > > > > > > > a
> > > > > > > > > coarse level). Yes, if transformations made earlier
> > > > > > > > > in
> > > > > > > > > the
> > > > > > > > > pipeline
> > > > > > > > > make different decisions, then that will invalidate
> > > > > > > > > later
> > > > > > > > > fine-grained data (at least potentially). I don't see
> > > > > > > > > how
> > > > > > > > > any
> > > > > > > > > of
> > > > > > > > > this makes this infeasible. We just need a way for
> > > > > > > > > the
> > > > > > > > > profiling
> > > > > > > > > counts, per descriminator, to remain available, and
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > transformations themselves to know which
> > > > > > > > > discriminators
> > > > > > > > > (loop
> > > > > > > > > ids,
> > > > > > > > > or whatever) to consider.
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > -Hal
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > > --paulr
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > --
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > > > > > Hal Finkel
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Lead, Compiler Technology and Programming Languages
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Leadership Computing Facility
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Argonne National Laboratory
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > --
> > > > 
> > > 
> > 
> 

> > > > > Hal Finkel
> > > > 
> > > 
> > 
> 
> > > > > Lead, Compiler Technology and Programming Languages
> > > > 
> > > 
> > 
> 
> > > > > Leadership Computing Facility
> > > > 
> > > 
> > 
> 
> > > > > Argonne National Laboratory
> > > > 
> > > 
> > 
> 

> > > > > _______________________________________________
> > > > 
> > > 
> > 
> 
> > > > > LLVM Developers mailing list
> > > > 
> > > 
> > 
> 
> > > > > llvm-dev at lists.llvm.org
> > > > 
> > > 
> > 
> 
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > > > 
> > > 
> > 
> 

> > > > --
> > > 
> > 
> 

> > > > Hal Finkel
> > > 
> > 
> 
> > > > Lead, Compiler Technology and Programming Languages
> > > 
> > 
> 
> > > > Leadership Computing Facility
> > > 
> > 
> 
> > > > Argonne National Laboratory
> > > 
> > 
> 

> > --
> 

> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/cf9c2d18/attachment.html>