[LLVMdev] RFC - Improvements to PGO profile support

Tue Mar 24 11:00:55 PDT 2015

On Tue, Mar 24, 2015 at 10:48 AM, Bob Wilson <bob.wilson at apple.com> wrote:

>
> > On Mar 24, 2015, at 10:27 AM, Xinliang David Li <davidxl at google.com>
> wrote:
> >
> > Diego and I have discussed this according to the feedback received. We
> > have revised plan for this (see Diego's last reply).  Here is a more
> > detailed re-cap:
> >
> > 1) keep MD_prof definition as it is today; also keep using the
> > frequency propagation as it is (assuming programs with irreducible
> > loops are not common and not important. If it turns out to be
> > otherwise, we will revisit this).
> > 2) fix all problems that lead to wrong 'frequency/count' computed from
> > the frequency propagation algorithm
> >   2.1) relax 32bit limit
> >   2.2) remove 4096 iteration count cap
> >   2.3) remove the 'laplace rule of succession' which can be very harmful
>
> I have not seen any consensus on this. I’m not at all convinced this would
> be a good idea. From what I’ve seen, using LaPlace’s rule avoids really bad
> behavior in cases where the counts are very small and has almost no impact
> when the counts are large.
>
>
This rule has big impact on loops. On the other hand, if the count is
small, it is usually not hot and not interesting. However if this is
considered important, we can introduce a flag to control the behavior.

thanks,

David

> >   2.4) Fix the bug in  'BranchProbabilityInfo::calcMetadataWeights'
> > that does capping without scaling
> >   2.5) Fix a similar bug in
> 'MachineBranchProbabilityInfo::getSumForBlock('
> >    .. other bugs found
> > 3) introduce a new meta data to represent function entry global
> > hotness (the implementation can choose to use entry execution count)
> > 4) implement frequency/profile update interfaces for Inline
> > transformations -- i.e., allowing importing callee's profile info into
> > caller with scaling factor computed from callsite count and callee
> > entry count
> > 5) Implement mechanism to compute, record and use profile program
> > summary data (This will be discussed separately).
> >
> >
> > thanks,
> >
> > David
> >
> >
> >
> > On Tue, Mar 24, 2015 at 9:59 AM, Philip Reames
> > <listmail at philipreames.com> wrote:
> >> On 03/10/2015 10:14 AM, Diego Novillo wrote:
> >>
> >>
> >>
> >> On Thu, Mar 5, 2015 at 11:29 AM, Bob Wilson <bob.wilson at apple.com>
> wrote:
> >>>
> >>>
> >>> On Mar 2, 2015, at 4:19 PM, Diego Novillo <dnovillo at google.com> wrote:
> >>>
> >>> On Thu, Feb 26, 2015 at 6:54 PM, Diego Novillo <dnovillo at google.com>
> >>> wrote:
> >>>
> >>>> I've created a few bugzilla issues with details of some of the things
> >>>> I'll be looking into. I'm not yet done wordsmithing the overall design
> >>>> document. I'll try to finish it by early next week at the latest.
> >>>
> >>>
> >>> The document is available at
> >>>
> >>>
> >>>
> https://docs.google.com/document/d/15VNiD-TmHqqao_8P-ArIsWj1KdtU-ElLFaYPmZdrDMI/edit?usp=sharing
> >>>
> >>> There are several topics covered. Ideally, I would prefer that we
> discuss
> >>> each topic separately. The main ones I will start working on are the
> ones
> >>> described in the bugzilla links we have in the doc.
> >>>
> >>> This is just a starting point for us. I am not at all concerned with
> >>> implementing exactly what is proposed in the document. In fact, if we
> can
> >>> get the same value using the existing support, all the better.
> >>>
> >>> OTOH, any other ideas that folks may have that work better than this
> are
> >>> more than welcome. I don't have really strong opinions on the matter.
> I am
> >>> fine with whatever works.
> >>>
> >>>
> >>> Thanks for the detailed write-up on this. Some of the issues definitely
> >>> need to be addressed. I am concerned, though, that some of the ideas
> may be
> >>> leading toward a scenario where we have essentially two completely
> different
> >>> ways of representing profile information in LLVM IR. It is great to
> have two
> >>> complementary approaches to collecting profile data, but two
> representations
> >>> in the IR would not make sense.
> >>
> >>
> >> Yeah, I don't think I'll continue to push for a new MD_count attribute.
> If
> >> we were to make MD_prof be a "real" execution count, that would be
> enough.
> >> Note that by re-using MD_prof we are not changing its meaning at all.
> The
> >> execution count is still a weight and the ratio is still branch
> probability.
> >> All that we are changing are the absolute values of the number and
> >> increasing its data type width to remove the 32bit limitation.
> >>
> >> Independent of everything else, relaxing the 32 bit restriction is
> clearly a
> >> good idea.  This would make a great standalone patch.
> >>
> >>
> >>
> >>>
> >>> The first issue raised is that profile execution counts are not
> >>> represented in the IR. This was a very intentional decision. I know it
> goes
> >>> against what other compilers have done in the past. It took me a while
> to
> >>> get used to the idea when Andy first suggested it, so I know it seems
> >>> awkward at first. The advantage is that branch probabilities are much
> easier
> >>> to keep updated in the face of compiler transformations, compared to
> >>> execution counts.
> >>
> >>
> >> Sorry. I don't follow. Updating counts as the CFG is transformed is not
> >> difficult at all. What examples do you have in mind?  The big advantage
> of
> >> making MD_prof an actual execution count is that it is a meaningful
> metric
> >> wrt scaling and transformation.
> >>
> >> Say, for instance, that we have a branch instruction with two targets
> with
> >> counts {100, 300} inside a function 'foo' that has entry count 2. The
> edge
> >> probability for the first edge (count 100) is 100/(100+300) = 25%.
> >>
> >> If we inline foo() inside another function bar() at a callsite with
> profile
> >> count == 1, the cloned branch instruction gets its counters scaled with
> the
> >> callsite count. So the new branch has counts {100 * 1 / 2, 300 * 1 / 2}
> =
> >> {50, 150}.  But the branch probability did not change. Currently, we are
> >> cloning the branch without changing the edge weights.
> >>
> >> This scaling is not difficult at all and can be incrementally very
> quickly.
> >> We cannot afford to recompute all frequencies on the fly because it
> would be
> >> detrimental to compile time. If foo() itself has N callees inlined into
> it,
> >> each inlined callee needs to trigger a re-computation. When foo() is
> inlined
> >> into bar(), the frequencies will need to be recomputed for foo() and
> all N
> >> callees inlined into foo().
> >>
> >> It really sounds like your proposal is to essentially eagerly compute
> >> scaling rather than lazyily compute it on demand.  Assuming perfect
> >> implementations for both (with no rounding losses), the results should
> be
> >> the same.  Is that a correct restatement?  I'm going to hold off on
> >> responding to why that's a bad idea until you confirm, because I'm not
> sure
> >> I follow what you're trying to say.  :)
> >>
> >> Also, trusting exact entry counts is going to be somewhat suspect.
> These
> >> are *highly* susceptible to racy updates, overflow, etc...  Anything
> which
> >> puts too much implicit trust in these numbers is going to be
> problematic.
> >>
> >>
> >>
> >>>
> >>> We are definitely missing the per-function execution counts that are
> >>> needed to be able to compare relative “hotness” across functions, and I
> >>> think that would be a good place to start making improvements. In the
> long
> >>> term, we should keep our options open to making major changes, but
> before we
> >>> go there, we should try to make incremental improvements to fix the
> existing
> >>> infrastructure.
> >>
> >>
> >> Right, and that's the core of our proposal. We don't really want to make
> >> major infrastructure changes at this point. The only thing I'd like to
> >> explore is making MD_prof a real count. This will be useful for the
> inliner
> >> changes and it would also make incremental updates easier, because the
> >> scaling that needs to be done is very straightforward and quick.
> >>
> >> Note that this change should not modify the current behaviour we get
> from
> >> profile analysis. Things that were hot before should continue to be hot
> now.
> >>
> >> I have no objection to adding a mechanism for expressing an entry
> count.  I
> >> am still very hesitant about the proposals with regards to redefining
> the
> >> current MD_prof.
> >>
> >> I'd encourage you to post a patch for the entry count mechanism, but
> not tie
> >> its semantics to exact execution count.  (Something like "the value
> provided
> >> must correctly describe the relative hotness of this routine against
> others
> >> in the program annoatated with the same metadata.  It is the relative
> >> scaling that is important, not the absolute value.  In particular, the
> value
> >> need not be an exact execution count.")
> >>
> >>
> >>>
> >>> Many of the other issues you raise seem like they could also be
> addressed
> >>> without major changes to the existing infrastructure. Let’s try to fix
> those
> >>> first.
> >>
> >>
> >> That's exactly the point of the proposal.  We definitely don't want to
> make
> >> major changes to the infrastructure at first. My thinking is to start
> >> working on making MD_prof a real count. One of the things that are
> happening
> >> is that the combination of real profile data plus the frequency
> propagation
> >> that we are currently doing is misleading the analysis.
> >>
> >> I consider this a major change.  You're trying to redefine a major part
> of
> >> the current system.
> >>
> >> Multiple people have spoken up and objected to this change (as currently
> >> described).  Please start somewhere else.
> >>
> >>
> >> For example (thanks David for the code and data). In the following code:
> >>
> >> int g;
> >> __attribute__((noinline)) void bar() {
> >> g++;
> >> }
> >>
> >> extern int printf(const char*, ...);
> >>
> >> int main()
> >> {
> >>  int i, j, k;
> >>
> >>  g = 0;
> >>
> >>  // Loop 1.
> >>  for (i = 0; i < 100; i++)
> >>    for (j = 0; j < 100; j++)
> >>       for (k = 0; k < 100; k++)
> >>           bar();
> >>
> >>  printf ("g = %d\n", g);
> >>  g = 0;
> >>
> >>  // Loop 2.
> >>  for (i = 0; i < 100; i++)
> >>    for (j = 0; j < 10000; j++)
> >>        bar();
> >>
> >>  printf ("g = %d\n", g);
> >>  g = 0;
> >>
> >>
> >>  // Loop 3.
> >>  for (i = 0; i < 1000000; i++)
> >>    bar();
> >>
> >>  printf ("g = %d\n", g);
> >>  g = 0;
> >> }
> >>
> >> When compiled with profile instrumentation, frequency propagation is
> >> distorting the real profile because it gives different frequency to the
> >> calls to bar() in the 3 different loops. All 3 loops execute 1,000,000
> >> times, but after frequency propagation, the first call to bar() gets a
> >> weight of 520,202 in loop #1, 210,944 in  loop #2 and 4,096 in loop #3.
> In
> >> reality, every call to bar() should have a weight of 1,000,000.
> >>
> >> Duncan responded to this.  My conclusion from his response: this is a
> bug,
> >> not a fundamental issue.  Remove the max scaling factor, switch the
> counts
> >> to 64 bits and everything should be fine.  If you disagree, let's
> discuss.
> >>
> >>
> >>
> >> Thanks.  Diego.
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>
> >>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150324/1a5cc98c/attachment.html>