[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Dehao Chen via llvm-dev llvm-dev at lists.llvm.org
Tue Nov 1 18:24:30 PDT 2016


On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 6:41:29 PM
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org>
>> *To: *"Dehao Chen" <dehao at google.com>
>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li" <
>> davidxl at google.com>
>> *Sent: *Tuesday, November 1, 2016 4:26:17 PM
>> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>>
>>
>> ------------------------------
>>
>> *From: *"Dehao Chen" <dehao at google.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"Paul Robinson" <paul.robinson at sony.com>, "Xinliang David Li" <
>> davidxl at google.com>, "llvm-dev" <llvm-dev at lists.llvm.org>
>> *Sent: *Tuesday, November 1, 2016 4:14:43 PM
>> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
>> discriminator
>>
>> damn... my english is not readable at all when I try to write fast...
>> trying to make some clarification below, hopefully can make it more
>> readable...
>>
>> On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com> wrote:
>>
>>> Oops... pressed the wrong button and sent out early...
>>>
>>> On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at google.com> wrote:
>>>
>>>> If Hal's proposal is for SamplePGO purpose, let me clarify some design
>>>> principles of SamplePGO.
>>>>
>>>> The profile for sample pgo uses source location as the key to map the
>>>> execution count back to IR. This design is based on the principle that we
>>>> do not want the profile to be tightly couple with compiler IR. Instead,
>>>> profile is simple an attribute of the source code. We have been benefited
>>>>  a lot from this design that the profile can easily be reused across
>>>> different source versions and compiler versions, or even compilers.
>>>>
>>>> That being said, the design to encode more info into discriminator does
>>>> not mean that we will change the profile. The encoded info in discriminator
>>>> will be handled by the create_llvm_prof tool, which combines counts from
>>>> different clones of the same source code and generate the combined profile
>>>> data. The output profile will not have any cloning/dupliaction bits at all.
>>>> So for the initial example profile I provided, the output profile will be:
>>>>
>>>
>>> #1: 10
>>> #3: 80
>>>
>>> Not:
>>>
>>> #1: 10
>>> #3.0x400: 70
>>> #3.0x10400: 5
>>> #3.0x20400: 3
>>> #3.0x30400: 2
>>>
>>
>> Also, how does this work for vectorization? For vectorization, you're
>> going to multiply by the duplication factor first? The issue obviously is
>> that each vector instruction counts for VF times as many scalar
>> instructions, and so to get equivalent scalar counts, you need to multiply
>> by VF. I had assumed this is what you were saying when I read the initial
>> e-mail, but if you're also using the same duplication scheme for unrolling,
>> then I think we need some way to differentiate.
>>
>
> In my original proposal, I only record VF*UF if the clone is both unrolled
> and vectorized. I did not distinguish between unroll and vectorization
> because I the unroll/vectorize decision only depends on the trip count,
> which can be derived from the duplication factor. If more profile info can
> be used for better unroll/vectorize decision making, we can definitely add
> it.
>
> I'm still missing something here. In your proposed system, does the tool
> collecting the profiling information *interpret* the duplication factor
> encoded in the descriminators at all?
>

Yes it does.


> If it does, it seems like this should be specific to vectorization. For
> unrolling, you still have the same number of instructions, and so you just
> need to make the different instructions from the different loop iterations
> carry different discriminator values (so they they'll sum together, instead
> of using the maximum, as you explained in your original proposal).
>

Let's take a look at the original loop unroll example. My proposed
discriminator assignment is:
for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
  foo();  // discriminator: 0x400
}
if (i++ < N) {
  foo();   // discriminator: 0x10000
  if (i++ < N) {
    foo(); // discriminator: 0x20000
    if (i++ < N) {
      foo();  // discriminator: 0x30000
    }
  }
}

IIRC, your proposed discriminator assignment is:

for (i = 0; i < N & 3; i+= 4) {
  foo();  // discriminator: 0x10000
  foo();  // discriminator: 0x20000
  foo();  // discriminator: 0x30000
  foo();  // discriminator: 0x40000
}
if (i++ < N) {
  foo();   // discriminator: 0x50000
  if (i++ < N) {
    foo(); // discriminator: 0x60000
    if (i++ < N) {
      foo();  // discriminator: 0x70000
    }
  }
}

I think my method can save discriminator space because all clones share the
same discriminator. Consider that if a loop is unrolled 100 times, then
with your algorithm, we may end up having 100 different discriminators,
which seem not good for debug info size.

>
>
>
>>
>>
>>  -Hal
>>
>>
>>> The goal of the proposed change, is to make profile more accurately
>>> represent the attribute of the source.
>>> The non-goal of the proposed change, is to provide more context in the
>>> profile to present the behavior of program in the context of different
>>> context.
>>>
>>
>> The non-goal of the proposed change, is to provide more context in the
>> profile to present the behavior of program in different contexts.'
>>
>> Okay, but why? The information will be present in the profiles, why not
>> encode it in the metadata and let the optimizer decide when to sum it up?
>> We should even provide some utility functions that make this transparent
>> for passes that don't care about the distinction.
>>
>> I see, you are proposing encoding these profile in the metadata. I agree
> that this can work, but from implementation's perspective, it could could
> lead to great complexity.
>
> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the raw
> count, but only the branch probability is stored after sample pgo pass.
>
> I think that, as you indicate below, we'd need to change this to {source,
> disc}->count map --> {BB, disc}->count map --> {branch, disc}->prob map.
> The idea being that the total probability for the branch is the sum of the
> probabilities associated with it for each discriminator value. Most users
> will just sum them up, but some passes will want the breakdowns.
>

Agree. Note that this would add some complexity to BFI/BPI

>
>
>>
>>>
>>> If we are pursuing more context-sensitive profile, that would be a very
>>> different design. In this design, we will need to read profile multiple
>>> times, and have the profile tightly coupled with the compiler IR/pass
>>> manager. That is doable, but I don't think that probably better suits
>>> instrumentation based PGO's domain. Comments?
>>>
>>
>> If we are pursuing more context-sensitive profile, that would be a very
>> different design in which we need to read in profiles multiple times, and
>> have the profile tightly coupled with the compiler IR/pass manager. That is
>> doable, but that probably better suits instrumentation based PGO's domain.
>> Comments?
>>
>> I don't understand why you say we'd need to read the profile multiple
>> times. Can you please explain? I also don't think it needs to be that
>> tightly coupled; we just need each pass that generates multiple copies of
>> things to have some way to generate a unique id for what it's doing. It's
>> all per source location, so I don't even think we need any kind of
>> complicated hashing scheme.
>>
>> As explained above, we only have branch probability. Yes, we can record
> more instruction/basicblock count context info in additional metadata. But
> when you use it in optimizer, I suppose you will need to convert it to
> branch probability too?
>
> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.
>
> original code:
> for ()
>   stmt1;
>
> optimized code:
> if (cond1)
>   for()
>     stmt1;  //100 samples
> else if (cond2)
>   for()
>     stmt1;  // 0 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The loop was multi-versioned by cond1 and cond2. With my proposal, the
> profile for stmt1 will be combined as 150 samples. If we use this profile
> to optimize the orginal source, after annotation, the code becomes:
>
> for()
>   stmt1 // 150 samples
>
> Later when it comes to the mulit-version optimization, it will transform
> the code to:
>
> if (cond1)
>   for()
>     stmt1;  //50 samples
> else if (cond2)
>   for()
>     stmt1;  // 50 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.
>
> With your proposal, after annotation, the original code becomes:
>
> for()
>   stmt1 // 150 samples (M1:100, M2:0, M3:50)
>
> Then during mutli-version optimization, the optimizer reads the metadata
> and find M2 is never taken, so it will transform the code to:
>
> if (cond1)
>   for()
>     stmt1;  // 100 samples
> else
>   for()
>     stmt1; // 50 samples
>
> There are two major parts that can benefit from the extra metadata
> recorded for each copy.
>
> 1. It can demonstrate that cond2 is always false and there is no need to
> create that version.
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)
>
> I think the most important benefit is #1. For #2, my guess is scaled count
> is already good enough.
>
> By what are you scaling the count?
>

It's random. Or most likely, each branch will be 50% taken.


>
>
> Here comes the problem: cond1 and cond2 are both generated by compiler.
> Then how do we map the metadata back to them? Using uid or special encoding
> seems fragile because when upstream optimization changes, compiler may
> choose to put cond2 in front of cond1 so the uid will change accordingly.
> How would we handle cases like this?
>
> Exactly for this reason, I don't think we can rely on the ordering to
> generate the identifiers; they need to have semantic meaning. I agree that
> this is tricky in general. However, for the cases I have in mind
> (unrolling, vectorization, etc.) these passes don't generate arbitrary
> conditionals, but rather, specific conditionals related to runtime overlap
> checks, trip-count thresholds, and I think that it should be easy to encode
> a discriminator value which can be unambiguously inverted.
>

I'm still not quite clear how to do this. Needs to think more thoroughly.
But one thing I want to note is that, we want to encode the context info
only when it will be useful for the optimization. For vectorization/unroll,
I think accurate trip count is already good enough. So do we want to record
its context?

Thanks,
Dehao


>
> Thanks again,
> Hal
>
>
> Dehao
>
>
>> Thanks again,
>> Hal
>>
>>
>>> Thanks,
>>> Dehao
>>>
>>>
>>>>
>>>> On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *From: *"Paul Robinson" <paul.robinson at sony.com>
>>>>> *To: *"Dehao Chen" <dehao at google.com>, "Hal Finkel" <hfinkel at anl.gov>
>>>>> *Cc: *"Xinliang David Li" <davidxl at google.com>,
>>>>> llvm-dev at lists.llvm.org
>>>>> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
>>>>> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication factor
>>>>> in        discriminator
>>>>>
>>>>> As illustrated in the above example, it is not like "vectorization has
>>>>> a distinct bit". All different optimizations make clones of code which will
>>>>> be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
>>>>> will be capped by the number of clones all optimizations have made, instead
>>>>> of # of optimizations that has applied. And it will be capped at 2^N-1. The
>>>>> cons of using uid is that you will not know if a clone is coming from
>>>>> vectorization or unroll or loop distribution.
>>>>>
>>>>> Okay, but that kind of semantic mapping is important. How should we
>>>>> encode/recover that information? To be clear, I'm not saying that we need
>>>>> to implement that up front, but there needs to be a clear path to an
>>>>> implementation, because I don't want to have two disjoint schemes.
>>>>>
>>>>>
>>>>>
>>>>> You mean that you want to know which optimization created the clone?
>>>>> How would you use that info? Looks to me this will expose compiler
>>>>> implementation detail in debug info.
>>>>>
>>>>>
>>>>>
>>>>> This is still doable, assume we have 15 interesting optimizations to
>>>>> track, we can use 4 bits to encode the optimization type that created the
>>>>> clone. But this becomes nasty if the a clone is created by more than one
>>>>> optimizations. In that way, discriminator may not be fit for this purpose.
>>>>>
>>>>>
>>>>>
>>>>> My understanding was that the encoding scheme would allow the
>>>>> profiling analysis to correctly map execution data back to the original
>>>>> source construct, while preserving the property that each distinct basic
>>>>> block would have its own discriminator value.  That is, the execution data
>>>>> would be attributed back to the original source construct, not whatever
>>>>> each individual optimization had done to it, and the data for the original
>>>>> source construct would correctly reflect the execution (e.g. profiling says
>>>>> you got 82 hits on the original loop, rather than reporting 20 hits on the
>>>>> unrolled-by-4 loop plus 1 each on 2 of the trailing copies).
>>>>>
>>>>>
>>>>>
>>>>> It sounds like Hal is thinking that the per-discriminator execution
>>>>> info would be preserved down to the point where an individual optimization
>>>>> could look at the profile for each piece, and make decisions on that basis.
>>>>>
>>>>>
>>>>>
>>>>> I'm not clear how that would be possible, as the optimization would
>>>>> have to first do the transform (or predict how it would do the transform)
>>>>> in order to see which individual-discriminator counts mapped to which
>>>>> actual blocks, and then make some kind of decision about whether to do the
>>>>> transform differently based on that information.  Then, if the optimization
>>>>> did choose to do the transform differently, then that leaves the IR in a
>>>>> state where the individual discriminators *cannot* map back to it.  (Say
>>>>> you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3,
>>>>> and a discriminator that maps to the second trailing copy now maps to
>>>>> nothing.  The individual-discriminator data becomes useless.)
>>>>>
>>>>>
>>>>>
>>>>> Am I expressing this well enough to show that what Hal is looking for
>>>>> is not feasible?
>>>>>
>>>>> Yes, it will need to predict how the transformation would affect the
>>>>> blocks produced. That does not seem problematic (at least at a coarse
>>>>> level). Yes, if transformations made earlier in the pipeline make different
>>>>> decisions, then that will invalidate later fine-grained data (at least
>>>>> potentially). I don't see how any of this makes this infeasible. We just
>>>>> need a way for the profiling counts, per descriminator, to remain
>>>>> available, and for the transformations themselves to know which
>>>>> discriminators (loop ids, or whatever) to consider.
>>>>>
>>>>>  -Hal
>>>>>
>>>>> --paulr
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/b30be627/attachment-0001.html>


More information about the llvm-dev mailing list