[LLVMdev] RFC - Improvements to PGO profile support

Fri May 22 08:16:21 PDT 2015

Hi all,

I am a bit confused about the documentation of the format of the profile
data file.

The Clang user guide here describes it as an ASCII text file:
http://clang.llvm.org/docs/UsersManual.html#sample-profile-format

Whereas the posts above and the referenced link describe it as a stream of
bytes containing LEB128s:
http://www.llvm.org/docs/CoverageMappingFormat.html

>From experimenting with the latest trunk I can see the latter is correct
(well, at least the file I get is not ASCII text).
Should we update the Clang user guide documentation?
Or am I just getting confused? Are there two formats, one used for coverage
and one used for PGO?

Cheers,
    Dario Domizioli
    SN Systems - Sony Computer Entertainment Group

On 7 May 2015 at 16:43, Bob Wilson <bob.wilson at apple.com> wrote:

>
> > On May 7, 2015, at 12:55 AM, Hayden Livingston <halivingston at gmail.com>
> wrote:
> >
> > Can you tell us if you're continuing to use the same approach as
> > described in one of the LLVM meetings, i.e. instrument at the clang
> > AST level?
>
> Yes, that is the approach we’re using.
>
> >
> > Also, do you generate GCOV files, some yaml, or is this a separate
> format?
>
> It is a separate format. The code for reading/writing profile data is in
> compiler-rt/lib/profile/.
>
> There is also a separate format for mapping the profile data back to
> source locations for code coverage testing. Details here:
> http://www.llvm.org/docs/CoverageMappingFormat.html
>
> >
> > And finally in the meeting you had given how you assign counters to
> > the blocks, an algorithm to minimize the number of insertions. Is that
> > algorithm a well-known one or a custom one? Is that described
> > somewhere?
>
> It is a custom approach. I don’t think we have a written description but
> the code is pretty straightforward. Look at ComputeRegionCounts in clang’s
> lib/CodeGen/CodeGenPGO.cpp source file.
>
> >
> > On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <davidxl at google.com>
> wrote:
> >> Bob,
> >>
> >>> Which workload is better? I don’t at all trust users to get this
> right, at
> >>> least for real, non-benchmark code.
> >>
> >> We do have a lot of users (real world apps) who can get this right --
> >> I am not joking ;)
> >>
> >>>
> >>>
> >>> Without the rule, the two workload at least produces consistent
> >>> profile data.  With the Laplace rule, you get 50 in one case, and 66
> >>> in the other.
> >>>
> >>>
> >>> Yes, but you’ve got more information in one case than the other. This
> is a
> >>> feature IMO, not a bug. It’s entirely possible that with workload 2,
> the
> >>> loop may have executed for a drastically different number of
> iterations. The
> >>> fact that it did not, i.e., that it was consistent with workload 1, is
> more
> >>> information that you did not have before. It makes sense for the
> compiler to
> >>> be more aggressive when it has more data.
> >>
> >> But the decision by the compiler is arbitrary and not necessarily
> >> correct.  For instance, the single run used in the training may have
> >> actually executed much fewer number of iterations than average. With
> >> Laplace rule, the iteration count becomes even smaller. My point is
> >> that there is no way for compiler to tell how good the data is nor is
> >> the compiler in a good position to make that judgement.  By so doing,
> >> the users who carefully prune their workload to reduce runtime gets
> >> punished for no reason.
> >>
> >>
> >>>
> >>>
> >>> Having some technology to improve confidence of the profile data is
> >>> fine, but I don't see
> >>> 1) how laplace rule is good for it
> >>>>
> >>> What do you not understand about it? As the counts get larger,
> LaPlace’s
> >>> rule fades into the noise. It only makes a difference for cases where
> some
> >>> of the counts are *very* small, and in those cases, it very simply
> adjust
> >>> the weights to make optimizations less aggressive.
> >>
> >> Strictly speaking, in loop context, it just makes optimizations to
> >> assume shorter trip counts.
> >>
> >>>
> >>> 2) why this can not be done in the consumer side (i.e., faithfully
> >>> record the profile data).
> >>>
> >>>
> >>> What does this have to do with how faithfully the profile is recorded?
> We’ve
> >>> got fully accurate data, but if the profiling inputs are too small or
> not
> >>> representative, you may still get poor optimization choices.
> >>
> >> The point is that there is no need to adjust the weights. It is very
> >> easy to check the loop header's profile count to determine how much
> >> confidence you want to give (and possibly controlled with flag). The
> >> control in this way is more fine grained than blindly changing the
> >> weight right after reading the profile data.
> >>
> >>>
> >>>
> >>>
> >>>
> >>> 2) result in bad inlining decisions. For instance:
> >>>  for (...)
> >>>      bar();  // (1)
> >>>
> >>> where (1) is the only callsite to bar().   Using the rule, BB count
> >>> enclosing the call to bar() can be as low as half of the entry count
> >>> of bar().  Inliner will get confused and think there are more hot
> >>> callsites to 'bar' and  make suboptimal decisions ..
> >>>
> >>> Also if bar has calls to other functions, those callsites will look
> >>> hotter than the call to 'bar' …
> >>>
> >>>
> >>> Your own proposal for recording entry counts is to record “relative
> >>> hotness”, not absolute profile counts.
> >>>
> >>>
> >>> The proposal is to record 'global hotness' that can used to compare
> >>> relative hotness across procedural boundaries (e.g. callsites in
> >>> different callers). Profile counts satisfies this condition.
> >>>
> >>> On the caller’s side, we’ve got a branch weight influenced by
> LaPlace’s rule
> >>> that is then used to compute BlockFrequency and you’re concerned about
> a
> >>> mismatch between that the “relative hotness” recorded for the callee??
> >>>
> >>>
> >>> Basically, say the caller is test()
> >>>
> >>> bar(){
> >>> // ENTRY count =  100 (from profile data)
> >>> // ENTRY freq = 1
> >>>
> >>> // BB2: Freq(BB2) = 1, count = 100
> >>> foo ();              (2)
> >>> }
> >>>
> >>>
> >>> test() {
> >>>  // ENTRY count = 1 (from profile data)
> >>>  // Entry Freq = 1
> >>>  for (i = 0; i < 100; i++) {
> >>>      // BB1: Freq(BB1) = 50 due to Laplace rule
> >>>      bar();  // Freq = 50, count = 50    (1)
> >>>   }
> >>> }
> >>>
> >>> With laplace rule, the block freq computed for bar's enclosing BB will
> >>> be wrong -- as a result, the bar's enclosing BB's count will  be wrong
> >>> too: 50*1/1 = 50.
> >>>
> >>> The global hotness of call site (1) & (2) should be the same, but
> >>> distorted when Laplace rule is used.
> >>>
> >>> Yes, we care about using PGO across routine boundaries for IPO.
> >>>
> >>>
> >>> I understand the issue, but my point was that you should simply not do
> that.
> >>> You’re objecting to LaPlace’s rule based on a hypothetical comparison
> of
> >>> block frequencies vs. entry counts. There is nothing in LLVM that does
> that
> >>> now. We don’t even have entry counts.
> >>
> >> I am not sure what you mean by 'hypothetical comparison of block
> >> frequencies vs entry counts', but it does not seem to be what I mean.
> >> What I mean is that
> >>
> >> 1) We need a metric to represent global hotness. Profile (execution)
> >> count fits the bill
> >> 2) There are two ways to compute profile count for BBs
> >>   a) directly compute it from the edge count recorded in profile data
> >> (and BB Frequency can be directly scaled from it), but this change
> >> requires slightly changing MD_prof's meaning or introducing MD_count
> >> to record edge count without capping/scaling.
> >>
> >>   b) Just recording the entry profile count (minimal change), but do
> >> not change MD_prof. This approach will reuse block frequency
> >> propagation, but the later relies on unaltered branch
> >> probability/weight in order to recompute precisely the count (combined
> >> with entry count).
> >>
> >> Since people have concerns on a), we chose b). For b), I merely
> >> pointed out in the above example that with Laplace rule, the
> >> recomputed profile count at the only callsite of 'Bar' can be greatly
> >> different from the recorded entry profile count Bar.  Incoming
> >> callsite's profile distribution can be good signal for inlining
> >> decisions. Such difference will be bad.
> >>
> >>>
> >>> I don’t see how you can argue that LaPlace’s rule is bad because it
> could
> >>> affect an apples vs. oranges comparison of something that does not even
> >>> exist yet.
> >>>
> >>
> >> Of course, PGO for IPA support is exactly the missing (and very
> >> important) piece we plan to add -- if it already existed, there will
> >> be no problems.
> >>
> >> thanks,
> >>
> >> David
> >>
> >>
> >>>
> >>>
> >>>
> >>> The attached are two cases as well as the frequency graph computed
> >>> today (with the laplace rule) and the correct frequency expected.
> >>>
> >>>
> >>> I’d be a lot more interested to see a real-world example.
> >>>
> >>>
> >>> See my reply above. On the other hand, I'd like to see examples where
> >>> LaPlace Rule can actually help improve the profile data quality.
> >>>
> >>>
> >>> It’s not about improving the results — it’s about preventing clang from
> >>> being overly aggressive about optimizing based on limited profile data.
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150522/ae5ebcf6/attachment.html>