[llvm-dev] (RFC) Encoding code duplication factor in discriminator

David Blaikie via llvm-dev llvm-dev at lists.llvm.org
Mon Nov 21 15:04:14 PST 2016


(+Adrian & Paul since they're mentioned here, +Diego since he might have
opinions as he implemented the discriminator stuff originally)

my 2c (that was mentioned here & that I idly discussed with Diego a few
weeks ago in no great detail) is also that maybe a separate mode for PGO
might be what we want/need, because it's not what sanitizers or simple
backtrace tools need & we're slowly growing the size of debug info for
profiling needs that we don't need elsewhere. (might be worth doing a full
measurement here as we add new features - what's the growth of all hte
profile related features (from (including) discriminators up) and see if
it's still a reasonable thing to keep with the rest of the debug info use
cases or needs a separate flag)

But I'm not too fussed/don't feel terribly strongly, really.

On Fri, Nov 4, 2016 at 3:44 PM Dehao Chen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Discussed with Hal, Adrain and Paul offline at the llvm dev meeting today.
>
> * trip count is not enough for vectorization, there is runtime check that
> might go false, which can be reflected in profile that we may want to
> preserve.
> * simply recording these context-profile may cause problems to
> iterative-sample-pgo. i.e. when you find a loop's vectorized version no
> executed (due to runtime check), you will choose not to vectorize (which is
> optimal). But when you collect profile from this binary (optimized with
> sample-pgo, not vectorize the loop), as the loop is not vectorized, we do
> not have the context to demonstrate "the loop should *not* vectorize" any
> more. So it will end up being vectorized again, introducing perf
> instability.
>
> To summarize, more context info may improve performance for one iteration,
> but the perf improvement may not be stable across iterations. If we aim at
> performance stability (which is one of the major goals of this RFC),
> profile should only reflect the attribute of source, not compiler
> transformations.
>
> But there are sample pgo users who does *not* care about iterative sample
> pgo performance, for them, as Hal suggested, we should invent a more
> extensible way to preserve profile context. Apparently discriminator is not
> an extensible choice. So how about we just use discriminator to store the
> attribute of the source (i.e trip count), and later design new extensible
> ways in dwarf to represent more context info?
>
> Adrian also suggested that we may need to consider have a flag or a
> separate debugging mode to indicate if we want to emit discriminator to
> prevent debug_line size increase.
>
> Dehao
>
> On Tue, Nov 1, 2016 at 9:34 PM, Dehao Chen <dehao at google.com> wrote:
>
>
>
> On Tue, Nov 1, 2016 at 7:35 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 8:24:30 PM
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Tue, Nov 1, 2016 at 5:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 6:41:29 PM
>
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
>
> On Tue, Nov 1, 2016 at 2:36 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>
> ------------------------------
>
> *From: *"Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org>
> *To: *"Dehao Chen" <dehao at google.com>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Xinliang David Li" <
> davidxl at google.com>
> *Sent: *Tuesday, November 1, 2016 4:26:17 PM
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
>
> ------------------------------
>
> *From: *"Dehao Chen" <dehao at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"Paul Robinson" <paul.robinson at sony.com>, "Xinliang David Li" <
> davidxl at google.com>, "llvm-dev" <llvm-dev at lists.llvm.org>
> *Sent: *Tuesday, November 1, 2016 4:14:43 PM
> *Subject: *Re: [llvm-dev] (RFC) Encoding code duplication factor in
> discriminator
>
> damn... my english is not readable at all when I try to write fast...
> trying to make some clarification below, hopefully can make it more
> readable...
>
> On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com> wrote:
>
> Oops... pressed the wrong button and sent out early...
>
> On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at google.com> wrote:
>
> If Hal's proposal is for SamplePGO purpose, let me clarify some design
> principles of SamplePGO.
>
> The profile for sample pgo uses source location as the key to map the
> execution count back to IR. This design is based on the principle that we
> do not want the profile to be tightly couple with compiler IR. Instead,
> profile is simple an attribute of the source code. We have been benefited
>  a lot from this design that the profile can easily be reused across
> different source versions and compiler versions, or even compilers.
>
> That being said, the design to encode more info into discriminator does
> not mean that we will change the profile. The encoded info in discriminator
> will be handled by the create_llvm_prof tool, which combines counts from
> different clones of the same source code and generate the combined profile
> data. The output profile will not have any cloning/dupliaction bits at all.
> So for the initial example profile I provided, the output profile will be:
>
>
> #1: 10
> #3: 80
>
> Not:
>
> #1: 10
> #3.0x400: 70
> #3.0x10400: 5
> #3.0x20400: 3
> #3.0x30400: 2
>
>
> Also, how does this work for vectorization? For vectorization, you're
> going to multiply by the duplication factor first? The issue obviously is
> that each vector instruction counts for VF times as many scalar
> instructions, and so to get equivalent scalar counts, you need to multiply
> by VF. I had assumed this is what you were saying when I read the initial
> e-mail, but if you're also using the same duplication scheme for unrolling,
> then I think we need some way to differentiate.
>
>
> In my original proposal, I only record VF*UF if the clone is both unrolled
> and vectorized. I did not distinguish between unroll and vectorization
> because I the unroll/vectorize decision only depends on the trip count,
> which can be derived from the duplication factor. If more profile info can
> be used for better unroll/vectorize decision making, we can definitely add
> it.
>
> I'm still missing something here. In your proposed system, does the tool
> collecting the profiling information *interpret* the duplication factor
> encoded in the descriminators at all?
>
>
> Yes it does.
>
> Okay. I'm still missing something. Can you please write out an example
> using your notation below for a vectorized loop?
>
>
> Sure. Original code:
>
> for (int i = 0; i < N; i++)
>   a[i] = b[i];
>
> Transformed code:
>
> for (int i = 0; i < N / 4; i++)
>   vectorized_assign_4(&a[i * 4], &b[i * 4]); // discriminator 0x400,
> sample count 20
> for (int i = N & ~3; i < N; i++)
>   a[i] = b[i];       // discriminator 0x10000, sample count 10
>
> The derived sample count for original a[i] = b[i] is: 20 * 4 + 10 = 90
>
>
>
>
>
> If it does, it seems like this should be specific to vectorization. For
> unrolling, you still have the same number of instructions, and so you just
> need to make the different instructions from the different loop iterations
> carry different discriminator values (so they they'll sum together, instead
> of using the maximum, as you explained in your original proposal).
>
>
> Let's take a look at the original loop unroll example. My proposed
> discriminator assignment is:
> for (i = 0; i < N & 3; i+= 4) {
>   foo();  // discriminator: 0x400
>   foo();  // discriminator: 0x400
>   foo();  // discriminator: 0x400
>   foo();  // discriminator: 0x400
> }
> if (i++ < N) {
>   foo();   // discriminator: 0x10000
>   if (i++ < N) {
>     foo(); // discriminator: 0x20000
>     if (i++ < N) {
>       foo();  // discriminator: 0x30000
>     }
>   }
> }
>
> IIRC, your proposed discriminator assignment is:
>
> To be clear, I don't think that I've proposed any assignment scheme at
> all. Regardless, I was asking this question purely in the context of your
> proposal.
>
>
> Oh I used this example to show my understanding of your reply "For
> unrolling, you still have the same number of instructions, and so you
> just need to make the different instructions from the different loop
> iterations carry different discriminator values (so they they'll sum
> together, instead of using the maximum, as you explained in your original
> proposal)." Please let me know if my understanding is incorrect.
>
> Thanks,
> Dehao
>
>
>
> Thanks again,
> Hal
>
>
> for (i = 0; i < N & 3; i+= 4) {
>   foo();  // discriminator: 0x10000
>   foo();  // discriminator: 0x20000
>   foo();  // discriminator: 0x30000
>   foo();  // discriminator: 0x40000
> }
> if (i++ < N) {
>   foo();   // discriminator: 0x50000
>   if (i++ < N) {
>     foo(); // discriminator: 0x60000
>     if (i++ < N) {
>       foo();  // discriminator: 0x70000
>     }
>   }
> }
>
> I think my method can save discriminator space because all clones share
> the same discriminator. Consider that if a loop is unrolled 100 times, then
> with your algorithm, we may end up having 100 different discriminators,
> which seem not good for debug info size.
>
>
>
>
>
>
>  -Hal
>
>
> The goal of the proposed change, is to make profile more accurately
> represent the attribute of the source.
> The non-goal of the proposed change, is to provide more context in the
> profile to present the behavior of program in the context of different
> context.
>
>
> The non-goal of the proposed change, is to provide more context in the
> profile to present the behavior of program in different contexts.'
>
> Okay, but why? The information will be present in the profiles, why not
> encode it in the metadata and let the optimizer decide when to sum it up?
> We should even provide some utility functions that make this transparent
> for passes that don't care about the distinction.
>
> I see, you are proposing encoding these profile in the metadata. I agree
> that this can work, but from implementation's perspective, it could could
> lead to great complexity.
>
> Sample PGO pass first read in the source->count map, use it to get
> basic_block->count map, and then use heuristics to propagate and get
> branch->probability count. So we don't have Metadata to store the raw
> count, but only the branch probability is stored after sample pgo pass.
>
> I think that, as you indicate below, we'd need to change this to {source,
> disc}->count map --> {BB, disc}->count map --> {branch, disc}->prob map.
> The idea being that the total probability for the branch is the sum of the
> probabilities associated with it for each discriminator value. Most users
> will just sum them up, but some passes will want the breakdowns.
>
>
> Agree. Note that this would add some complexity to BFI/BPI
>
>
>
>
>
> If we are pursuing more context-sensitive profile, that would be a very
> different design. In this design, we will need to read profile multiple
> times, and have the profile tightly coupled with the compiler IR/pass
> manager. That is doable, but I don't think that probably better suits
> instrumentation based PGO's domain. Comments?
>
>
> If we are pursuing more context-sensitive profile, that would be a very
> different design in which we need to read in profiles multiple times, and
> have the profile tightly coupled with the compiler IR/pass manager. That is
> doable, but that probably better suits instrumentation based PGO's domain.
> Comments?
>
> I don't understand why you say we'd need to read the profile multiple
> times. Can you please explain? I also don't think it needs to be that
> tightly coupled; we just need each pass that generates multiple copies of
> things to have some way to generate a unique id for what it's doing. It's
> all per source location, so I don't even think we need any kind of
> complicated hashing scheme.
>
> As explained above, we only have branch probability. Yes, we can record
> more instruction/basicblock count context info in additional metadata. But
> when you use it in optimizer, I suppose you will need to convert it to
> branch probability too?
>
> Let me take an example to demonstrate my understanding of your use of
> context-sensitive profile.
>
> original code:
> for ()
>   stmt1;
>
> optimized code:
> if (cond1)
>   for()
>     stmt1;  //100 samples
> else if (cond2)
>   for()
>     stmt1;  // 0 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The loop was multi-versioned by cond1 and cond2. With my proposal, the
> profile for stmt1 will be combined as 150 samples. If we use this profile
> to optimize the orginal source, after annotation, the code becomes:
>
> for()
>   stmt1 // 150 samples
>
> Later when it comes to the mulit-version optimization, it will transform
> the code to:
>
> if (cond1)
>   for()
>     stmt1;  //50 samples
> else if (cond2)
>   for()
>     stmt1;  // 50 samples
> else
>   for()
>     stmt1; // 50 samples
>
> The 150 samples are evenly distributed to 3 clones as we do not have
> context info.
>
> With your proposal, after annotation, the original code becomes:
>
> for()
>   stmt1 // 150 samples (M1:100, M2:0, M3:50)
>
> Then during mutli-version optimization, the optimizer reads the metadata
> and find M2 is never taken, so it will transform the code to:
>
> if (cond1)
>   for()
>     stmt1;  // 100 samples
> else
>   for()
>     stmt1; // 50 samples
>
> There are two major parts that can benefit from the extra metadata
> recorded for each copy.
>
> 1. It can demonstrate that cond2 is always false and there is no need to
> create that version.
> 2. It can correctly attribute samples to cloned copy (i.e. 100, 50
> respectively in this case)
>
> I think the most important benefit is #1. For #2, my guess is scaled count
> is already good enough.
>
> By what are you scaling the count?
>
>
> It's random. Or most likely, each branch will be 50% taken.
>
>
>
>
> Here comes the problem: cond1 and cond2 are both generated by compiler.
> Then how do we map the metadata back to them? Using uid or special encoding
> seems fragile because when upstream optimization changes, compiler may
> choose to put cond2 in front of cond1 so the uid will change accordingly.
> How would we handle cases like this?
>
> Exactly for this reason, I don't think we can rely on the ordering to
> generate the identifiers; they need to have semantic meaning. I agree that
> this is tricky in general. However, for the cases I have in mind
> (unrolling, vectorization, etc.) these passes don't generate arbitrary
> conditionals, but rather, specific conditionals related to runtime overlap
> checks, trip-count thresholds, and I think that it should be easy to encode
> a discriminator value which can be unambiguously inverted.
>
>
> I'm still not quite clear how to do this. Needs to think more thoroughly.
> But one thing I want to note is that, we want to encode the context info
> only when it will be useful for the optimization. For vectorization/unroll,
> I think accurate trip count is already good enough. So do we want to record
> its context?
>
> Thanks,
> Dehao
>
>
>
> Thanks again,
> Hal
>
>
> Dehao
>
>
> Thanks again,
> Hal
>
>
> Thanks,
> Dehao
>
>
>
> On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>
> ------------------------------
>
> *From: *"Paul Robinson" <paul.robinson at sony.com>
> *To: *"Dehao Chen" <dehao at google.com>, "Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"Xinliang David Li" <davidxl at google.com>, llvm-dev at lists.llvm.org
> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication factor
> in        discriminator
>
> As illustrated in the above example, it is not like "vectorization has a
> distinct bit". All different optimizations make clones of code which will
> be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
> will be capped by the number of clones all optimizations have made, instead
> of # of optimizations that has applied. And it will be capped at 2^N-1. The
> cons of using uid is that you will not know if a clone is coming from
> vectorization or unroll or loop distribution.
>
> Okay, but that kind of semantic mapping is important. How should we
> encode/recover that information? To be clear, I'm not saying that we need
> to implement that up front, but there needs to be a clear path to an
> implementation, because I don't want to have two disjoint schemes.
>
>
>
> You mean that you want to know which optimization created the clone? How
> would you use that info? Looks to me this will expose compiler
> implementation detail in debug info.
>
>
>
> This is still doable, assume we have 15 interesting optimizations to
> track, we can use 4 bits to encode the optimization type that created the
> clone. But this becomes nasty if the a clone is created by more than one
> optimizations. In that way, discriminator may not be fit for this purpose.
>
>
>
> My understanding was that the encoding scheme would allow the profiling
> analysis to correctly map execution data back to the original source
> construct, while preserving the property that each distinct basic block
> would have its own discriminator value.  That is, the execution data would
> be attributed back to the original source construct, not whatever each
> individual optimization had done to it, and the data for the original
> source construct would correctly reflect the execution (e.g. profiling says
> you got 82 hits on the original loop, rather than reporting 20 hits on the
> unrolled-by-4 loop plus 1 each on 2 of the trailing copies).
>
>
>
> It sounds like Hal is thinking that the per-discriminator execution info
> would be preserved down to the point where an individual optimization could
> look at the profile for each piece, and make decisions on that basis.
>
>
>
> I'm not clear how that would be possible, as the optimization would have
> to first do the transform (or predict how it would do the transform) in
> order to see which individual-discriminator counts mapped to which actual
> blocks, and then make some kind of decision about whether to do the
> transform differently based on that information.  Then, if the optimization
> did choose to do the transform differently, then that leaves the IR in a
> state where the individual discriminators *cannot* map back to it.  (Say
> you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3,
> and a discriminator that maps to the second trailing copy now maps to
> nothing.  The individual-discriminator data becomes useless.)
>
>
>
> Am I expressing this well enough to show that what Hal is looking for is
> not feasible?
>
> Yes, it will need to predict how the transformation would affect the
> blocks produced. That does not seem problematic (at least at a coarse
> level). Yes, if transformations made earlier in the pipeline make different
> decisions, then that will invalidate later fine-grained data (at least
> potentially). I don't see how any of this makes this infeasible. We just
> need a way for the profiling counts, per descriminator, to remain
> available, and for the transformations themselves to know which
> discriminators (loop ids, or whatever) to consider.
>
>  -Hal
>
> --paulr
>
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161121/6d033fb4/attachment.html>


More information about the llvm-dev mailing list