[LLVMdev] RFC - Improvements to PGO profile support

Tue Mar 24 10:13:16 PDT 2015

On 03/24/2015 09:59 AM, Philip Reames wrote:
> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>>
>>
>> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com 
>> <mailto:bob.wilson at apple.com>> wrote:
>>
>>
>>>     On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com
>>>     <mailto:dnovillo at google.com>> wrote:
>>>
>>>     On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo
>>>     <dnovillo at google.com <mailto:dnovillo at google.com>> wrote:
>>>
>>>         I've created a few bugzilla issues with details of some of
>>>         the things I'll be looking into. I'm not yet done
>>>         wordsmithing the overall design document. I'll try to finish
>>>         it by early next week at the latest.
>>>
>>>
>>>     The document is available at
>>>
>>>     https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>>     <https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>>
>>>     There are several topics covered. Ideally, I would prefer that
>>>     we discuss each topic separately. The main ones I will start
>>>     working on are the ones described in the bugzilla links we have
>>>     in the doc.
>>>
>>>     This is just a starting point for us. I am not at all concerned
>>>     with implementing exactly what is proposed in the document. In
>>>     fact, if we can get the same value using the existing support,
>>>     all the better.
>>>
>>>     OTOH, any other ideas that folks may have that work better than
>>>     this are more than welcome. I don't have really strong opinions
>>>     on the matter. I am fine with whatever works.
>>
>>     Thanks for the detailed write-up on this. Some of the issues
>>     definitely need to be addressed. I am concerned, though, that
>>     some of the ideas may be leading toward a scenario where we have
>>     essentially two completely different ways of representing profile
>>     information in LLVM IR. It is great to have two complementary
>>     approaches to collecting profile data, but two representations in
>>     the IR would not make sense.
>>
>>
>> Yeah, I don't think I'll continue to push for a new MD_count 
>> attribute. If we were to make MD_prof be a "real" execution count, 
>> that would be enough. Note that by re-using MD_prof we are not 
>> changing its meaning at all. The execution count is still a weight 
>> and the ratio is still branch probability. All that we are changing 
>> are the absolute values of the number and increasing its data type 
>> width to remove the 32bit limitation.
> Independent of everything else, relaxing the 32 bit restriction is 
> clearly a good idea.  This would make a great standalone patch.
>>
>>
>>
>>     The first issue raised is that profile execution counts are not
>>     represented in the IR. This was a very intentional decision. I
>>     know it goes against what other compilers have done in the past.
>>     It took me a while to get used to the idea when Andy first
>>     suggested it, so I know it seems awkward at first. The advantage
>>     is that branch probabilities are much easier to keep updated in
>>     the face of compiler transformations, compared to execution counts.
>>
>>
>> Sorry. I don't follow. Updating counts as the CFG is transformed is 
>> not difficult at all. What examples do you have in mind?  The big 
>> advantage of making MD_prof an actual execution count is that it is a 
>> meaningful metric wrt scaling and transformation.
>>
>> Say, for instance, that we have a branch instruction with two targets 
>> with counts {100, 300} inside a function 'foo' that has entry count 
>> 2. The edge probability for the first edge (count 100) is 
>> 100/(100+300) = 25%.
>>
>> If we inline foo() inside another function bar() at a callsite with 
>> profile count == 1, the cloned branch instruction gets its counters 
>> scaled with the callsite count. So the new branch has counts {100 * 1 
>> / 2, 300 * 1 / 2} = {50, 150}.  But the branch probability did not 
>> change. Currently, we are cloning the branch without changing the 
>> edge weights.
>>
>> This scaling is not difficult at all and can be incrementally very 
>> quickly. We cannot afford to recompute all frequencies on the fly 
>> because it would be detrimental to compile time. If foo() itself has 
>> N callees inlined into it, each inlined callee needs to trigger a 
>> re-computation. When foo() is inlined into bar(), the frequencies 
>> will need to be recomputed for foo() and all N callees inlined into 
>> foo().
> It really sounds like your proposal is to essentially eagerly compute 
> scaling rather than lazyily compute it on demand. Assuming perfect 
> implementations for both (with no rounding losses), the results should 
> be the same.  Is that a correct restatement?  I'm going to hold off on 
> responding to why that's a bad idea until you confirm, because I'm not 
> sure I follow what you're trying to say.  :)
>
> Also, trusting exact entry counts is going to be somewhat suspect.  
> These are *highly* susceptible to racy updates, overflow, etc...  
> Anything which puts too much implicit trust in these numbers is going 
> to be problematic.
>>
>>
>>     We are definitely missing the per-function execution counts that
>>     are needed to be able to compare relative “hotness” across
>>     functions, and I think that would be a good place to start making
>>     improvements. In the long term, we should keep our options open
>>     to making major changes, but before we go there, we should try to
>>     make incremental improvements to fix the existing infrastructure.
>>
>>
>> Right, and that's the core of our proposal. We don't really want to 
>> make major infrastructure changes at this point. The only thing I'd 
>> like to explore is making MD_prof a real count. This will be useful 
>> for the inliner changes and it would also make incremental updates 
>> easier, because the scaling that needs to be done is very 
>> straightforward and quick.
>>
>> Note that this change should not modify the current behaviour we get 
>> from profile analysis. Things that were hot before should continue to 
>> be hot now.
> I have no objection to adding a mechanism for expressing an entry 
> count.  I am still very hesitant about the proposals with regards to 
> redefining the current MD_prof.
>
> I'd encourage you to post a patch for the entry count mechanism, but 
> not tie its semantics to exact execution count.  (Something like "the 
> value provided must correctly describe the relative hotness of this 
> routine against others in the program annoatated with the same 
> metadata.  It is the relative scaling that is important, not the 
> absolute value.  In particular, the value need not be an exact 
> execution count.")
>>
>>
>>     Many of the other issues you raise seem like they could also be
>>     addressed without major changes to the existing infrastructure.
>>     Let’s try to fix those first.
>>
>>
>> That's exactly the point of the proposal.  We definitely don't want 
>> to make major changes to the infrastructure at first. My thinking is 
>> to start working on making MD_prof a real count. One of the things 
>> that are happening is that the combination of real profile data plus 
>> the frequency propagation that we are currently doing is misleading 
>> the analysis.
> I consider this a major change.  You're trying to redefine a major 
> part of the current system.
>
> Multiple people have spoken up and objected to this change (as 
> currently described).  Please start somewhere else.
Sorry, "objected" is too strong a word here.  I should have said "have 
reservations about".

My point here - which I'm not sure I expressed well - is that I'm 
concerned we're going to bog down on a controversial change rather than 
make progress on the parts we agree on.  I'm not saying that we should 
never redefine things in the way your proposing, but I would like to see 
the parts that work under both schemes be addressed first.  We can 
continue this discussion in parallel.
>>
>> For example (thanks David for the code and data). In the following code:
>>
>> int g;
>> __attribute__((noinline)) void bar() {
>>  g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>> int i, j, k;
>>
>> g = 0;
>>
>> // Loop 1.
>> for (i = 0; i < 100; i++)
>>   for (j = 0; j < 100; j++)
>>      for (k = 0; k < 100; k++)
>>          bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>>
>> // Loop 2.
>> for (i = 0; i < 100; i++)
>>   for (j = 0; j < 10000; j++)
>>       bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>>
>>
>> // Loop 3.
>> for (i = 0; i < 1000000; i++)
>>   bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>> }
>>
>> When compiled with profile instrumentation, frequency propagation is 
>> distorting the real profile because it gives different frequency to 
>> the calls to bar() in the 3 different loops. All 3 loops execute 
>> 1,000,000 times, but after frequency propagation, the first call to 
>> bar() gets a weight of 520,202 in loop #1, 210,944 in  loop #2 and 
>> 4,096 in loop #3. In reality, every call to bar() should have a 
>> weight of 1,000,000.
> Duncan responded to this.  My conclusion from his response: this is a 
> bug, not a fundamental issue.  Remove the max scaling factor, switch 
> the counts to 64 bits and everything should be fine.  If you disagree, 
> let's discuss.
>>
>>
>> Thanks.  Diego.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu          http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/a92786c3/attachment.html>