[LLVMdev] RFC - Improvements to PGO profile support
Bob Wilson
bob.wilson at apple.com
Thu May 7 08:43:35 PDT 2015
> On May 7, 2015, at 12:55 AM, Hayden Livingston <halivingston at gmail.com> wrote:
>
> Can you tell us if you're continuing to use the same approach as
> described in one of the LLVM meetings, i.e. instrument at the clang
> AST level?
Yes, that is the approach we’re using.
>
> Also, do you generate GCOV files, some yaml, or is this a separate format?
It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.
There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: http://www.llvm.org/docs/CoverageMappingFormat.html
>
> And finally in the meeting you had given how you assign counters to
> the blocks, an algorithm to minimize the number of insertions. Is that
> algorithm a well-known one or a custom one? Is that described
> somewhere?
It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.
>
> On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <davidxl at google.com> wrote:
>> Bob,
>>
>>> Which workload is better? I don’t at all trust users to get this right, at
>>> least for real, non-benchmark code.
>>
>> We do have a lot of users (real world apps) who can get this right --
>> I am not joking ;)
>>
>>>
>>>
>>> Without the rule, the two workload at least produces consistent
>>> profile data. With the Laplace rule, you get 50 in one case, and 66
>>> in the other.
>>>
>>>
>>> Yes, but you’ve got more information in one case than the other. This is a
>>> feature IMO, not a bug. It’s entirely possible that with workload 2, the
>>> loop may have executed for a drastically different number of iterations. The
>>> fact that it did not, i.e., that it was consistent with workload 1, is more
>>> information that you did not have before. It makes sense for the compiler to
>>> be more aggressive when it has more data.
>>
>> But the decision by the compiler is arbitrary and not necessarily
>> correct. For instance, the single run used in the training may have
>> actually executed much fewer number of iterations than average. With
>> Laplace rule, the iteration count becomes even smaller. My point is
>> that there is no way for compiler to tell how good the data is nor is
>> the compiler in a good position to make that judgement. By so doing,
>> the users who carefully prune their workload to reduce runtime gets
>> punished for no reason.
>>
>>
>>>
>>>
>>> Having some technology to improve confidence of the profile data is
>>> fine, but I don't see
>>> 1) how laplace rule is good for it
>>>>
>>> What do you not understand about it? As the counts get larger, LaPlace’s
>>> rule fades into the noise. It only makes a difference for cases where some
>>> of the counts are *very* small, and in those cases, it very simply adjust
>>> the weights to make optimizations less aggressive.
>>
>> Strictly speaking, in loop context, it just makes optimizations to
>> assume shorter trip counts.
>>
>>>
>>> 2) why this can not be done in the consumer side (i.e., faithfully
>>> record the profile data).
>>>
>>>
>>> What does this have to do with how faithfully the profile is recorded? We’ve
>>> got fully accurate data, but if the profiling inputs are too small or not
>>> representative, you may still get poor optimization choices.
>>
>> The point is that there is no need to adjust the weights. It is very
>> easy to check the loop header's profile count to determine how much
>> confidence you want to give (and possibly controlled with flag). The
>> control in this way is more fine grained than blindly changing the
>> weight right after reading the profile data.
>>
>>>
>>>
>>>
>>>
>>> 2) result in bad inlining decisions. For instance:
>>> for (...)
>>> bar(); // (1)
>>>
>>> where (1) is the only callsite to bar(). Using the rule, BB count
>>> enclosing the call to bar() can be as low as half of the entry count
>>> of bar(). Inliner will get confused and think there are more hot
>>> callsites to 'bar' and make suboptimal decisions ..
>>>
>>> Also if bar has calls to other functions, those callsites will look
>>> hotter than the call to 'bar' …
>>>
>>>
>>> Your own proposal for recording entry counts is to record “relative
>>> hotness”, not absolute profile counts.
>>>
>>>
>>> The proposal is to record 'global hotness' that can used to compare
>>> relative hotness across procedural boundaries (e.g. callsites in
>>> different callers). Profile counts satisfies this condition.
>>>
>>> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule
>>> that is then used to compute BlockFrequency and you’re concerned about a
>>> mismatch between that the “relative hotness” recorded for the callee??
>>>
>>>
>>> Basically, say the caller is test()
>>>
>>> bar(){
>>> // ENTRY count = 100 (from profile data)
>>> // ENTRY freq = 1
>>>
>>> // BB2: Freq(BB2) = 1, count = 100
>>> foo (); (2)
>>> }
>>>
>>>
>>> test() {
>>> // ENTRY count = 1 (from profile data)
>>> // Entry Freq = 1
>>> for (i = 0; i < 100; i++) {
>>> // BB1: Freq(BB1) = 50 due to Laplace rule
>>> bar(); // Freq = 50, count = 50 (1)
>>> }
>>> }
>>>
>>> With laplace rule, the block freq computed for bar's enclosing BB will
>>> be wrong -- as a result, the bar's enclosing BB's count will be wrong
>>> too: 50*1/1 = 50.
>>>
>>> The global hotness of call site (1) & (2) should be the same, but
>>> distorted when Laplace rule is used.
>>>
>>> Yes, we care about using PGO across routine boundaries for IPO.
>>>
>>>
>>> I understand the issue, but my point was that you should simply not do that.
>>> You’re objecting to LaPlace’s rule based on a hypothetical comparison of
>>> block frequencies vs. entry counts. There is nothing in LLVM that does that
>>> now. We don’t even have entry counts.
>>
>> I am not sure what you mean by 'hypothetical comparison of block
>> frequencies vs entry counts', but it does not seem to be what I mean.
>> What I mean is that
>>
>> 1) We need a metric to represent global hotness. Profile (execution)
>> count fits the bill
>> 2) There are two ways to compute profile count for BBs
>> a) directly compute it from the edge count recorded in profile data
>> (and BB Frequency can be directly scaled from it), but this change
>> requires slightly changing MD_prof's meaning or introducing MD_count
>> to record edge count without capping/scaling.
>>
>> b) Just recording the entry profile count (minimal change), but do
>> not change MD_prof. This approach will reuse block frequency
>> propagation, but the later relies on unaltered branch
>> probability/weight in order to recompute precisely the count (combined
>> with entry count).
>>
>> Since people have concerns on a), we chose b). For b), I merely
>> pointed out in the above example that with Laplace rule, the
>> recomputed profile count at the only callsite of 'Bar' can be greatly
>> different from the recorded entry profile count Bar. Incoming
>> callsite's profile distribution can be good signal for inlining
>> decisions. Such difference will be bad.
>>
>>>
>>> I don’t see how you can argue that LaPlace’s rule is bad because it could
>>> affect an apples vs. oranges comparison of something that does not even
>>> exist yet.
>>>
>>
>> Of course, PGO for IPA support is exactly the missing (and very
>> important) piece we plan to add -- if it already existed, there will
>> be no problems.
>>
>> thanks,
>>
>> David
>>
>>
>>>
>>>
>>>
>>> The attached are two cases as well as the frequency graph computed
>>> today (with the laplace rule) and the correct frequency expected.
>>>
>>>
>>> I’d be a lot more interested to see a real-world example.
>>>
>>>
>>> See my reply above. On the other hand, I'd like to see examples where
>>> LaPlace Rule can actually help improve the profile data quality.
>>>
>>>
>>> It’s not about improving the results — it’s about preventing clang from
>>> being overly aggressive about optimizing based on limited profile data.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
More information about the llvm-dev
mailing list