[llvm-dev] (RFC) Encoding code duplication factor in discriminator

Tue Nov 1 14:14:43 PDT 2016

damn... my english is not readable at all when I try to write fast...
trying to make some clarification below, hopefully can make it more
readable...

On Tue, Nov 1, 2016 at 2:07 PM, Dehao Chen <dehao at google.com> wrote:

> Oops... pressed the wrong button and sent out early...
>
> On Tue, Nov 1, 2016 at 2:01 PM, Dehao Chen <dehao at google.com> wrote:
>
>> If Hal's proposal is for SamplePGO purpose, let me clarify some design
>> principles of SamplePGO.
>>
>> The profile for sample pgo uses source location as the key to map the
>> execution count back to IR. This design is based on the principle that we
>> do not want the profile to be tightly couple with compiler IR. Instead,
>> profile is simple an attribute of the source code. We have been benefited
>>  a lot from this design that the profile can easily be reused across
>> different source versions and compiler versions, or even compilers.
>>
>> That being said, the design to encode more info into discriminator does
>> not mean that we will change the profile. The encoded info in discriminator
>> will be handled by the create_llvm_prof tool, which combines counts from
>> different clones of the same source code and generate the combined profile
>> data. The output profile will not have any cloning/dupliaction bits at all.
>> So for the initial example profile I provided, the output profile will be:
>>
>
> #1: 10
> #3: 80
>
> Not:
>
> #1: 10
> #3.0x400: 70
> #3.0x10400: 5
> #3.0x20400: 3
> #3.0x30400: 2
>
> The goal of the proposed change, is to make profile more accurately
> represent the attribute of the source.
> The non-goal of the proposed change, is to provide more context in the
> profile to present the behavior of program in the context of different
> context.
>

The non-goal of the proposed change, is to provide more context in the
profile to present the behavior of program in different contexts.

>
> If we are pursuing more context-sensitive profile, that would be a very
> different design. In this design, we will need to read profile multiple
> times, and have the profile tightly coupled with the compiler IR/pass
> manager. That is doable, but I don't think that probably better suits
> instrumentation based PGO's domain. Comments?
>

If we are pursuing more context-sensitive profile, that would be a very
different design in which we need to read in profiles multiple times, and
have the profile tightly coupled with the compiler IR/pass manager. That is
doable, but that probably better suits instrumentation based PGO's domain.
Comments?

>
> Thanks,
> Dehao
>
>
>>
>> On Tue, Nov 1, 2016 at 1:04 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>> ------------------------------
>>>
>>> *From: *"Paul Robinson" <paul.robinson at sony.com>
>>> *To: *"Dehao Chen" <dehao at google.com>, "Hal Finkel" <hfinkel at anl.gov>
>>> *Cc: *"Xinliang David Li" <davidxl at google.com>, llvm-dev at lists.llvm.org
>>> *Sent: *Tuesday, November 1, 2016 2:15:38 PM
>>> *Subject: *RE: [llvm-dev] (RFC) Encoding code duplication factor
>>> in        discriminator
>>>
>>> As illustrated in the above example, it is not like "vectorization has a
>>> distinct bit". All different optimizations make clones of code which will
>>> be labeled by UIDs represented by N (e.g. 8) bits. In this way, the space
>>> will be capped by the number of clones all optimizations have made, instead
>>> of # of optimizations that has applied. And it will be capped at 2^N-1. The
>>> cons of using uid is that you will not know if a clone is coming from
>>> vectorization or unroll or loop distribution.
>>>
>>> Okay, but that kind of semantic mapping is important. How should we
>>> encode/recover that information? To be clear, I'm not saying that we need
>>> to implement that up front, but there needs to be a clear path to an
>>> implementation, because I don't want to have two disjoint schemes.
>>>
>>>
>>>
>>> You mean that you want to know which optimization created the clone? How
>>> would you use that info? Looks to me this will expose compiler
>>> implementation detail in debug info.
>>>
>>>
>>>
>>> This is still doable, assume we have 15 interesting optimizations to
>>> track, we can use 4 bits to encode the optimization type that created the
>>> clone. But this becomes nasty if the a clone is created by more than one
>>> optimizations. In that way, discriminator may not be fit for this purpose.
>>>
>>>
>>>
>>> My understanding was that the encoding scheme would allow the profiling
>>> analysis to correctly map execution data back to the original source
>>> construct, while preserving the property that each distinct basic block
>>> would have its own discriminator value.  That is, the execution data would
>>> be attributed back to the original source construct, not whatever each
>>> individual optimization had done to it, and the data for the original
>>> source construct would correctly reflect the execution (e.g. profiling says
>>> you got 82 hits on the original loop, rather than reporting 20 hits on the
>>> unrolled-by-4 loop plus 1 each on 2 of the trailing copies).
>>>
>>>
>>>
>>> It sounds like Hal is thinking that the per-discriminator execution info
>>> would be preserved down to the point where an individual optimization could
>>> look at the profile for each piece, and make decisions on that basis.
>>>
>>>
>>>
>>> I'm not clear how that would be possible, as the optimization would have
>>> to first do the transform (or predict how it would do the transform) in
>>> order to see which individual-discriminator counts mapped to which actual
>>> blocks, and then make some kind of decision about whether to do the
>>> transform differently based on that information.  Then, if the optimization
>>> did choose to do the transform differently, then that leaves the IR in a
>>> state where the individual discriminators *cannot* map back to it.  (Say
>>> you unroll by 2 instead of 4; then you have only 1 trailing copy, not 3,
>>> and a discriminator that maps to the second trailing copy now maps to
>>> nothing.  The individual-discriminator data becomes useless.)
>>>
>>>
>>>
>>> Am I expressing this well enough to show that what Hal is looking for is
>>> not feasible?
>>>
>>> Yes, it will need to predict how the transformation would affect the
>>> blocks produced. That does not seem problematic (at least at a coarse
>>> level). Yes, if transformations made earlier in the pipeline make different
>>> decisions, then that will invalidate later fine-grained data (at least
>>> potentially). I don't see how any of this makes this infeasible. We just
>>> need a way for the profiling counts, per descriminator, to remain
>>> available, and for the transformations themselves to know which
>>> discriminators (loop ids, or whatever) to consider.
>>>
>>>  -Hal
>>>
>>> --paulr
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161101/896183c4/attachment.html>