[LLVMdev] RFC - Improvements to PGO profile support

Philip Reames listmail at philipreames.com
Tue Mar 24 13:57:30 PDT 2015


On 03/24/2015 10:27 AM, Xinliang David Li wrote:
> Diego and I have discussed this according to the feedback received. We
> have revised plan for this (see Diego's last reply).  Here is a more
> detailed re-cap:
>
> 1) keep MD_prof definition as it is today; also keep using the
> frequency propagation as it is (assuming programs with irreducible
> loops are not common and not important. If it turns out to be
> otherwise, we will revisit this).
> 2) fix all problems that lead to wrong 'frequency/count' computed from
> the frequency propagation algorithm
>     2.1) relax 32bit limit
>     2.2) remove 4096 iteration count cap
>     2.3) remove the 'laplace rule of succession' which can be very harmful
>     2.4) Fix the bug in  'BranchProbabilityInfo::calcMetadataWeights'
> that does capping without scaling
>     2.5) Fix a similar bug in 'MachineBranchProbabilityInfo::getSumForBlock('
>      .. other bugs found
> 3) introduce a new meta data to represent function entry global
> hotness (the implementation can choose to use entry execution count)
> 4) implement frequency/profile update interfaces for Inline
> transformations -- i.e., allowing importing callee's profile info into
> caller with scaling factor computed from callsite count and callee
> entry count
> 5) Implement mechanism to compute, record and use profile program
> summary data (This will be discussed separately).
This sounds like a very workable approach to me.  I think 2.3 needs a 
bit more discussion, but given it doesn't really directly effect me, I'm 
probably going to stay out of it for now.

Thanks for working on this!
>
>
> thanks,
>
> David
>
>
>
> On Tue, Mar 24, 2015 at 9:59 AM, Philip Reames
> <listmail at philipreames.com> wrote:
>> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>>
>>
>>
>> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com> wrote:
>>>
>>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com> wrote:
>>>
>>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at google.com>
>>> wrote:
>>>
>>>> I've created a few bugzilla issues with details of some of the things
>>>> I'll be looking into. I'm not yet done wordsmithing the overall design
>>>> document. I'll try to finish it by early next week at the latest.
>>>
>>> The document is available at
>>>
>>>
>>> https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>>
>>> There are several topics covered. Ideally, I would prefer that we discuss
>>> each topic separately. The main ones I will start working on are the ones
>>> described in the bugzilla links we have in the doc.
>>>
>>> This is just a starting point for us. I am not at all concerned with
>>> implementing exactly what is proposed in the document. In fact, if we can
>>> get the same value using the existing support, all the better.
>>>
>>> OTOH, any other ideas that folks may have that work better than this are
>>> more than welcome. I don't have really strong opinions on the matter. I am
>>> fine with whatever works.
>>>
>>>
>>> Thanks for the detailed write-up on this. Some of the issues definitely
>>> need to be addressed. I am concerned, though, that some of the ideas may be
>>> leading toward a scenario where we have essentially two completely different
>>> ways of representing profile information in LLVM IR. It is great to have two
>>> complementary approaches to collecting profile data, but two representations
>>> in the IR would not make sense.
>>
>> Yeah, I don't think I'll continue to push for a new MD_count attribute. If
>> we were to make MD_prof be a "real" execution count, that would be enough.
>> Note that by re-using MD_prof we are not changing its meaning at all. The
>> execution count is still a weight and the ratio is still branch probability.
>> All that we are changing are the absolute values of the number and
>> increasing its data type width to remove the 32bit limitation.
>>
>> Independent of everything else, relaxing the 32 bit restriction is clearly a
>> good idea.  This would make a great standalone patch.
>>
>>
>>
>>> The first issue raised is that profile execution counts are not
>>> represented in the IR. This was a very intentional decision. I know it goes
>>> against what other compilers have done in the past. It took me a while to
>>> get used to the idea when Andy first suggested it, so I know it seems
>>> awkward at first. The advantage is that branch probabilities are much easier
>>> to keep updated in the face of compiler transformations, compared to
>>> execution counts.
>>
>> Sorry. I don't follow. Updating counts as the CFG is transformed is not
>> difficult at all. What examples do you have in mind?  The big advantage of
>> making MD_prof an actual execution count is that it is a meaningful metric
>> wrt scaling and transformation.
>>
>> Say, for instance, that we have a branch instruction with two targets with
>> counts {100, 300} inside a function 'foo' that has entry count 2. The edge
>> probability for the first edge (count 100) is 100/(100+300) = 25%.
>>
>> If we inline foo() inside another function bar() at a callsite with profile
>> count == 1, the cloned branch instruction gets its counters scaled with the
>> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2} =
>> {50, 150}.  But the branch probability did not change. Currently, we are
>> cloning the branch without changing the edge weights.
>>
>> This scaling is not difficult at all and can be incrementally very quickly.
>> We cannot afford to recompute all frequencies on the fly because it would be
>> detrimental to compile time. If foo() itself has N callees inlined into it,
>> each inlined callee needs to trigger a re-computation. When foo() is inlined
>> into bar(), the frequencies will need to be recomputed for foo() and all N
>> callees inlined into foo().
>>
>> It really sounds like your proposal is to essentially eagerly compute
>> scaling rather than lazyily compute it on demand.  Assuming perfect
>> implementations for both (with no rounding losses), the results should be
>> the same.  Is that a correct restatement?  I'm going to hold off on
>> responding to why that's a bad idea until you confirm, because I'm not sure
>> I follow what you're trying to say.  :)
>>
>> Also, trusting exact entry counts is going to be somewhat suspect.  These
>> are *highly* susceptible to racy updates, overflow, etc...  Anything which
>> puts too much implicit trust in these numbers is going to be problematic.
>>
>>
>>
>>> We are definitely missing the per-function execution counts that are
>>> needed to be able to compare relative “hotness” across functions, and I
>>> think that would be a good place to start making improvements. In the long
>>> term, we should keep our options open to making major changes, but before we
>>> go there, we should try to make incremental improvements to fix the existing
>>> infrastructure.
>>
>> Right, and that's the core of our proposal. We don't really want to make
>> major infrastructure changes at this point. The only thing I'd like to
>> explore is making MD_prof a real count. This will be useful for the inliner
>> changes and it would also make incremental updates easier, because the
>> scaling that needs to be done is very straightforward and quick.
>>
>> Note that this change should not modify the current behaviour we get from
>> profile analysis. Things that were hot before should continue to be hot now.
>>
>> I have no objection to adding a mechanism for expressing an entry count.  I
>> am still very hesitant about the proposals with regards to redefining the
>> current MD_prof.
>>
>> I'd encourage you to post a patch for the entry count mechanism, but not tie
>> its semantics to exact execution count.  (Something like "the value provided
>> must correctly describe the relative hotness of this routine against others
>> in the program annoatated with the same metadata.  It is the relative
>> scaling that is important, not the absolute value.  In particular, the value
>> need not be an exact execution count.")
>>
>>
>>> Many of the other issues you raise seem like they could also be addressed
>>> without major changes to the existing infrastructure. Let’s try to fix those
>>> first.
>>
>> That's exactly the point of the proposal.  We definitely don't want to make
>> major changes to the infrastructure at first. My thinking is to start
>> working on making MD_prof a real count. One of the things that are happening
>> is that the combination of real profile data plus the frequency propagation
>> that we are currently doing is misleading the analysis.
>>
>> I consider this a major change.  You're trying to redefine a major part of
>> the current system.
>>
>> Multiple people have spoken up and objected to this change (as currently
>> described).  Please start somewhere else.
>>
>>
>> For example (thanks David for the code and data). In the following code:
>>
>> int g;
>> __attribute__((noinline)) void bar() {
>>   g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>>    int i, j, k;
>>
>>    g = 0;
>>
>>    // Loop 1.
>>    for (i = 0; i < 100; i++)
>>      for (j = 0; j < 100; j++)
>>         for (k = 0; k < 100; k++)
>>             bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>>
>>    // Loop 2.
>>    for (i = 0; i < 100; i++)
>>      for (j = 0; j < 10000; j++)
>>          bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>>
>>
>>    // Loop 3.
>>    for (i = 0; i < 1000000; i++)
>>      bar();
>>
>>    printf ("g = %d\n", g);
>>    g = 0;
>> }
>>
>> When compiled with profile instrumentation, frequency propagation is
>> distorting the real profile because it gives different frequency to the
>> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
>> times, but after frequency propagation, the first call to bar() gets a
>> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3. In
>> reality, every call to bar() should have a weight of 1,000,000.
>>
>> Duncan responded to this.  My conclusion from his response: this is a bug,
>> not a fundamental issue.  Remove the max scaling factor, switch the counts
>> to 64 bits and everything should be fine.  If you disagree, let's discuss.
>>
>>
>>
>> Thanks.  Diego.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>





More information about the llvm-dev mailing list