[LLVMdev] RFC - Improvements to PGO profile support

Xinliang David Li davidxl at google.com
Thu May 7 10:02:15 PDT 2015


On Thu, May 7, 2015 at 9:57 AM, Hayden Livingston
<halivingston at gmail.com> wrote:
> Thanks Bob and David.
>
> Bob, was GCOV skipped because of any specific reason?
>
> David, when you say you have a late instrumentation pass, is that done
> at the IR level that is already optimized and therefore changed?

yes at IR level but only after some early optimizations.

>How
> are mapping back to the source code?

It is using CFG based matching. It is not used for code coverage
purpose, so source mapping is less important.

David


>
> On Thu, May 7, 2015 at 9:08 AM, Xinliang David Li <davidxl at google.com> wrote:
>> On Thu, May 7, 2015 at 8:43 AM, Bob Wilson <bob.wilson at apple.com> wrote:
>>>
>>>> On May 7, 2015, at 12:55 AM, Hayden Livingston <halivingston at gmail.com> wrote:
>>>>
>>>> Can you tell us if you're continuing to use the same approach as
>>>> described in one of the LLVM meetings, i.e. instrument at the clang
>>>> AST level?
>>>
>>> Yes, that is the approach we’re using.
>>>
>>
>> We are still collecting more data, but since the question is asked, I
>> will share our plan now.
>>
>> Instrumentation in Clang AST is too slow for large C++ applications
>> (the data and explanation will be sent out in a different thread). To
>> solve that problem, we have introduced the 'late' instrumentation pass
>> that is done in LLVM. The late instrumentation uses the exact same
>> runtime support as the current frontent based instrumentation, but
>> assign edge counters based on minimal spanning trees. We have seen
>> very promising results. Note This is not intended to replace the
>> current front-end based instrumentation, but as an alternative under
>> an option.
>>
>> thanks,
>>
>> david
>>
>>
>>>>
>>>> Also, do you generate GCOV files, some yaml, or is this a separate format?
>>>
>>> It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.
>>>
>>> There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: http://www.llvm.org/docs/CoverageMappingFormat.html
>>>
>>>>
>>>> And finally in the meeting you had given how you assign counters to
>>>> the blocks, an algorithm to minimize the number of insertions. Is that
>>>> algorithm a well-known one or a custom one? Is that described
>>>> somewhere?
>>>
>>> It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.
>>>
>>>>
>>>> On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <davidxl at google.com> wrote:
>>>>> Bob,
>>>>>
>>>>>> Which workload is better? I don’t at all trust users to get this right, at
>>>>>> least for real, non-benchmark code.
>>>>>
>>>>> We do have a lot of users (real world apps) who can get this right --
>>>>> I am not joking ;)
>>>>>
>>>>>>
>>>>>>
>>>>>> Without the rule, the two workload at least produces consistent
>>>>>> profile data.  With the Laplace rule, you get 50 in one case, and 66
>>>>>> in the other.
>>>>>>
>>>>>>
>>>>>> Yes, but you’ve got more information in one case than the other. This is a
>>>>>> feature IMO, not a bug. It’s entirely possible that with workload 2, the
>>>>>> loop may have executed for a drastically different number of iterations. The
>>>>>> fact that it did not, i.e., that it was consistent with workload 1, is more
>>>>>> information that you did not have before. It makes sense for the compiler to
>>>>>> be more aggressive when it has more data.
>>>>>
>>>>> But the decision by the compiler is arbitrary and not necessarily
>>>>> correct.  For instance, the single run used in the training may have
>>>>> actually executed much fewer number of iterations than average. With
>>>>> Laplace rule, the iteration count becomes even smaller. My point is
>>>>> that there is no way for compiler to tell how good the data is nor is
>>>>> the compiler in a good position to make that judgement.  By so doing,
>>>>> the users who carefully prune their workload to reduce runtime gets
>>>>> punished for no reason.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Having some technology to improve confidence of the profile data is
>>>>>> fine, but I don't see
>>>>>> 1) how laplace rule is good for it
>>>>>>>
>>>>>> What do you not understand about it? As the counts get larger, LaPlace’s
>>>>>> rule fades into the noise. It only makes a difference for cases where some
>>>>>> of the counts are *very* small, and in those cases, it very simply adjust
>>>>>> the weights to make optimizations less aggressive.
>>>>>
>>>>> Strictly speaking, in loop context, it just makes optimizations to
>>>>> assume shorter trip counts.
>>>>>
>>>>>>
>>>>>> 2) why this can not be done in the consumer side (i.e., faithfully
>>>>>> record the profile data).
>>>>>>
>>>>>>
>>>>>> What does this have to do with how faithfully the profile is recorded? We’ve
>>>>>> got fully accurate data, but if the profiling inputs are too small or not
>>>>>> representative, you may still get poor optimization choices.
>>>>>
>>>>> The point is that there is no need to adjust the weights. It is very
>>>>> easy to check the loop header's profile count to determine how much
>>>>> confidence you want to give (and possibly controlled with flag). The
>>>>> control in this way is more fine grained than blindly changing the
>>>>> weight right after reading the profile data.
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2) result in bad inlining decisions. For instance:
>>>>>>  for (...)
>>>>>>      bar();  // (1)
>>>>>>
>>>>>> where (1) is the only callsite to bar().   Using the rule, BB count
>>>>>> enclosing the call to bar() can be as low as half of the entry count
>>>>>> of bar().  Inliner will get confused and think there are more hot
>>>>>> callsites to 'bar' and  make suboptimal decisions ..
>>>>>>
>>>>>> Also if bar has calls to other functions, those callsites will look
>>>>>> hotter than the call to 'bar' …
>>>>>>
>>>>>>
>>>>>> Your own proposal for recording entry counts is to record “relative
>>>>>> hotness”, not absolute profile counts.
>>>>>>
>>>>>>
>>>>>> The proposal is to record 'global hotness' that can used to compare
>>>>>> relative hotness across procedural boundaries (e.g. callsites in
>>>>>> different callers). Profile counts satisfies this condition.
>>>>>>
>>>>>> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule
>>>>>> that is then used to compute BlockFrequency and you’re concerned about a
>>>>>> mismatch between that the “relative hotness” recorded for the callee??
>>>>>>
>>>>>>
>>>>>> Basically, say the caller is test()
>>>>>>
>>>>>> bar(){
>>>>>> // ENTRY count =  100 (from profile data)
>>>>>> // ENTRY freq = 1
>>>>>>
>>>>>> // BB2: Freq(BB2) = 1, count = 100
>>>>>> foo ();              (2)
>>>>>> }
>>>>>>
>>>>>>
>>>>>> test() {
>>>>>>  // ENTRY count = 1 (from profile data)
>>>>>>  // Entry Freq = 1
>>>>>>  for (i = 0; i < 100; i++) {
>>>>>>      // BB1: Freq(BB1) = 50 due to Laplace rule
>>>>>>      bar();  // Freq = 50, count = 50    (1)
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> With laplace rule, the block freq computed for bar's enclosing BB will
>>>>>> be wrong -- as a result, the bar's enclosing BB's count will  be wrong
>>>>>> too: 50*1/1 = 50.
>>>>>>
>>>>>> The global hotness of call site (1) & (2) should be the same, but
>>>>>> distorted when Laplace rule is used.
>>>>>>
>>>>>> Yes, we care about using PGO across routine boundaries for IPO.
>>>>>>
>>>>>>
>>>>>> I understand the issue, but my point was that you should simply not do that.
>>>>>> You’re objecting to LaPlace’s rule based on a hypothetical comparison of
>>>>>> block frequencies vs. entry counts. There is nothing in LLVM that does that
>>>>>> now. We don’t even have entry counts.
>>>>>
>>>>> I am not sure what you mean by 'hypothetical comparison of block
>>>>> frequencies vs entry counts', but it does not seem to be what I mean.
>>>>> What I mean is that
>>>>>
>>>>> 1) We need a metric to represent global hotness. Profile (execution)
>>>>> count fits the bill
>>>>> 2) There are two ways to compute profile count for BBs
>>>>>   a) directly compute it from the edge count recorded in profile data
>>>>> (and BB Frequency can be directly scaled from it), but this change
>>>>> requires slightly changing MD_prof's meaning or introducing MD_count
>>>>> to record edge count without capping/scaling.
>>>>>
>>>>>   b) Just recording the entry profile count (minimal change), but do
>>>>> not change MD_prof. This approach will reuse block frequency
>>>>> propagation, but the later relies on unaltered branch
>>>>> probability/weight in order to recompute precisely the count (combined
>>>>> with entry count).
>>>>>
>>>>> Since people have concerns on a), we chose b). For b), I merely
>>>>> pointed out in the above example that with Laplace rule, the
>>>>> recomputed profile count at the only callsite of 'Bar' can be greatly
>>>>> different from the recorded entry profile count Bar.  Incoming
>>>>> callsite's profile distribution can be good signal for inlining
>>>>> decisions. Such difference will be bad.
>>>>>
>>>>>>
>>>>>> I don’t see how you can argue that LaPlace’s rule is bad because it could
>>>>>> affect an apples vs. oranges comparison of something that does not even
>>>>>> exist yet.
>>>>>>
>>>>>
>>>>> Of course, PGO for IPA support is exactly the missing (and very
>>>>> important) piece we plan to add -- if it already existed, there will
>>>>> be no problems.
>>>>>
>>>>> thanks,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The attached are two cases as well as the frequency graph computed
>>>>>> today (with the laplace rule) and the correct frequency expected.
>>>>>>
>>>>>>
>>>>>> I’d be a lot more interested to see a real-world example.
>>>>>>
>>>>>>
>>>>>> See my reply above. On the other hand, I'd like to see examples where
>>>>>> LaPlace Rule can actually help improve the profile data quality.
>>>>>>
>>>>>>
>>>>>> It’s not about improving the results — it’s about preventing clang from
>>>>>> being overly aggressive about optimizing based on limited profile data.
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>




More information about the llvm-dev mailing list