[LLVMdev] RFC - Improvements to PGO profile support
Philip Reames
listmail at philipreames.com
Tue Mar 24 10:13:16 PDT 2015
On 03/24/2015 09:59 AM, Philip Reames wrote:
> On 03/10/2015 10:14 AM, Diego Novillo wrote:
>>
>>
>> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com
>> <mailto:bob.wilson at apple.com>> wrote:
>>
>>
>>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com
>>> <mailto:dnovillo at google.com>> wrote:
>>>
>>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo
>>> <dnovillo at google.com <mailto:dnovillo at google.com>> wrote:
>>>
>>> I've created a few bugzilla issues with details of some of
>>> the things I'll be looking into. I'm not yet done
>>> wordsmithing the overall design document. I'll try to finish
>>> it by early next week at the latest.
>>>
>>>
>>> The document is available at
>>>
>>> https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
>>> <https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing>
>>>
>>> There are several topics covered. Ideally, I would prefer that
>>> we discuss each topic separately. The main ones I will start
>>> working on are the ones described in the bugzilla links we have
>>> in the doc.
>>>
>>> This is just a starting point for us. I am not at all concerned
>>> with implementing exactly what is proposed in the document. In
>>> fact, if we can get the same value using the existing support,
>>> all the better.
>>>
>>> OTOH, any other ideas that folks may have that work better than
>>> this are more than welcome. I don't have really strong opinions
>>> on the matter. I am fine with whatever works.
>>
>> Thanks for the detailed write-up on this. Some of the issues
>> definitely need to be addressed. I am concerned, though, that
>> some of the ideas may be leading toward a scenario where we have
>> essentially two completely different ways of representing profile
>> information in LLVM IR. It is great to have two complementary
>> approaches to collecting profile data, but two representations in
>> the IR would not make sense.
>>
>>
>> Yeah, I don't think I'll continue to push for a new MD_count
>> attribute. If we were to make MD_prof be a "real" execution count,
>> that would be enough. Note that by re-using MD_prof we are not
>> changing its meaning at all. The execution count is still a weight
>> and the ratio is still branch probability. All that we are changing
>> are the absolute values of the number and increasing its data type
>> width to remove the 32bit limitation.
> Independent of everything else, relaxing the 32 bit restriction is
> clearly a good idea. This would make a great standalone patch.
>>
>>
>>
>> The first issue raised is that profile execution counts are not
>> represented in the IR. This was a very intentional decision. I
>> know it goes against what other compilers have done in the past.
>> It took me a while to get used to the idea when Andy first
>> suggested it, so I know it seems awkward at first. The advantage
>> is that branch probabilities are much easier to keep updated in
>> the face of compiler transformations, compared to execution counts.
>>
>>
>> Sorry. I don't follow. Updating counts as the CFG is transformed is
>> not difficult at all. What examples do you have in mind? The big
>> advantage of making MD_prof an actual execution count is that it is a
>> meaningful metric wrt scaling and transformation.
>>
>> Say, for instance, that we have a branch instruction with two targets
>> with counts {100, 300} inside a function 'foo' that has entry count
>> 2. The edge probability for the first edge (count 100) is
>> 100/(100+300) = 25%.
>>
>> If we inline foo() inside another function bar() at a callsite with
>> profile count == 1, the cloned branch instruction gets its counters
>> scaled with the callsite count. So the new branch has counts {100 * 1
>> / 2, 300 * 1 / 2} = {50, 150}. But the branch probability did not
>> change. Currently, we are cloning the branch without changing the
>> edge weights.
>>
>> This scaling is not difficult at all and can be incrementally very
>> quickly. We cannot afford to recompute all frequencies on the fly
>> because it would be detrimental to compile time. If foo() itself has
>> N callees inlined into it, each inlined callee needs to trigger a
>> re-computation. When foo() is inlined into bar(), the frequencies
>> will need to be recomputed for foo() and all N callees inlined into
>> foo().
> It really sounds like your proposal is to essentially eagerly compute
> scaling rather than lazyily compute it on demand. Assuming perfect
> implementations for both (with no rounding losses), the results should
> be the same. Is that a correct restatement? I'm going to hold off on
> responding to why that's a bad idea until you confirm, because I'm not
> sure I follow what you're trying to say. :)
>
> Also, trusting exact entry counts is going to be somewhat suspect.
> These are *highly* susceptible to racy updates, overflow, etc...
> Anything which puts too much implicit trust in these numbers is going
> to be problematic.
>>
>>
>> We are definitely missing the per-function execution counts that
>> are needed to be able to compare relative “hotness” across
>> functions, and I think that would be a good place to start making
>> improvements. In the long term, we should keep our options open
>> to making major changes, but before we go there, we should try to
>> make incremental improvements to fix the existing infrastructure.
>>
>>
>> Right, and that's the core of our proposal. We don't really want to
>> make major infrastructure changes at this point. The only thing I'd
>> like to explore is making MD_prof a real count. This will be useful
>> for the inliner changes and it would also make incremental updates
>> easier, because the scaling that needs to be done is very
>> straightforward and quick.
>>
>> Note that this change should not modify the current behaviour we get
>> from profile analysis. Things that were hot before should continue to
>> be hot now.
> I have no objection to adding a mechanism for expressing an entry
> count. I am still very hesitant about the proposals with regards to
> redefining the current MD_prof.
>
> I'd encourage you to post a patch for the entry count mechanism, but
> not tie its semantics to exact execution count. (Something like "the
> value provided must correctly describe the relative hotness of this
> routine against others in the program annoatated with the same
> metadata. It is the relative scaling that is important, not the
> absolute value. In particular, the value need not be an exact
> execution count.")
>>
>>
>> Many of the other issues you raise seem like they could also be
>> addressed without major changes to the existing infrastructure.
>> Let’s try to fix those first.
>>
>>
>> That's exactly the point of the proposal. We definitely don't want
>> to make major changes to the infrastructure at first. My thinking is
>> to start working on making MD_prof a real count. One of the things
>> that are happening is that the combination of real profile data plus
>> the frequency propagation that we are currently doing is misleading
>> the analysis.
> I consider this a major change. You're trying to redefine a major
> part of the current system.
>
> Multiple people have spoken up and objected to this change (as
> currently described). Please start somewhere else.
Sorry, "objected" is too strong a word here. I should have said "have
reservations about".
My point here - which I'm not sure I expressed well - is that I'm
concerned we're going to bog down on a controversial change rather than
make progress on the parts we agree on. I'm not saying that we should
never redefine things in the way your proposing, but I would like to see
the parts that work under both schemes be addressed first. We can
continue this discussion in parallel.
>>
>> For example (thanks David for the code and data). In the following code:
>>
>> int g;
>> __attribute__((noinline)) void bar() {
>> g++;
>> }
>>
>> extern int printf(const char*, ...);
>>
>> int main()
>> {
>> int i, j, k;
>>
>> g = 0;
>>
>> // Loop 1.
>> for (i = 0; i < 100; i++)
>> for (j = 0; j < 100; j++)
>> for (k = 0; k < 100; k++)
>> bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>>
>> // Loop 2.
>> for (i = 0; i < 100; i++)
>> for (j = 0; j < 10000; j++)
>> bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>>
>>
>> // Loop 3.
>> for (i = 0; i < 1000000; i++)
>> bar();
>>
>> printf ("g = %d\n", g);
>> g = 0;
>> }
>>
>> When compiled with profile instrumentation, frequency propagation is
>> distorting the real profile because it gives different frequency to
>> the calls to bar() in the 3 different loops. All 3 loops execute
>> 1,000,000 times, but after frequency propagation, the first call to
>> bar() gets a weight of 520,202 in loop #1, 210,944 in loop #2 and
>> 4,096 in loop #3. In reality, every call to bar() should have a
>> weight of 1,000,000.
> Duncan responded to this. My conclusion from his response: this is a
> bug, not a fundamental issue. Remove the max scaling factor, switch
> the counts to 64 bits and everything should be fine. If you disagree,
> let's discuss.
>>
>>
>> Thanks. Diego.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/a92786c3/attachment.html>
More information about the llvm-dev
mailing list