[LLVMdev] RFC - Improvements to PGO profile support

Thu May 7 08:43:35 PDT 2015

> On May 7, 2015, at 12:55 AM, Hayden Livingston <halivingston at gmail.com> wrote:
> 
> Can you tell us if you're continuing to use the same approach as
> described in one of the LLVM meetings, i.e. instrument at the clang
> AST level?

Yes, that is the approach we’re using.

> 
> Also, do you generate GCOV files, some yaml, or is this a separate format?

It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.

There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: http://www.llvm.org/docs/CoverageMappingFormat.html

> 
> And finally in the meeting you had given how you assign counters to
> the blocks, an algorithm to minimize the number of insertions. Is that
> algorithm a well-known one or a custom one? Is that described
> somewhere?

It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.

> 
> On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <davidxl at google.com> wrote:
>> Bob,
>> 
>>> Which workload is better? I don’t at all trust users to get this right, at
>>> least for real, non-benchmark code.
>> 
>> We do have a lot of users (real world apps) who can get this right --
>> I am not joking ;)
>> 
>>> 
>>> 
>>> Without the rule, the two workload at least produces consistent
>>> profile data.  With the Laplace rule, you get 50 in one case, and 66
>>> in the other.
>>> 
>>> 
>>> Yes, but you’ve got more information in one case than the other. This is a
>>> feature IMO, not a bug. It’s entirely possible that with workload 2, the
>>> loop may have executed for a drastically different number of iterations. The
>>> fact that it did not, i.e., that it was consistent with workload 1, is more
>>> information that you did not have before. It makes sense for the compiler to
>>> be more aggressive when it has more data.
>> 
>> But the decision by the compiler is arbitrary and not necessarily
>> correct.  For instance, the single run used in the training may have
>> actually executed much fewer number of iterations than average. With
>> Laplace rule, the iteration count becomes even smaller. My point is
>> that there is no way for compiler to tell how good the data is nor is
>> the compiler in a good position to make that judgement.  By so doing,
>> the users who carefully prune their workload to reduce runtime gets
>> punished for no reason.
>> 
>> 
>>> 
>>> 
>>> Having some technology to improve confidence of the profile data is
>>> fine, but I don't see
>>> 1) how laplace rule is good for it
>>>> 
>>> What do you not understand about it? As the counts get larger, LaPlace’s
>>> rule fades into the noise. It only makes a difference for cases where some
>>> of the counts are *very* small, and in those cases, it very simply adjust
>>> the weights to make optimizations less aggressive.
>> 
>> Strictly speaking, in loop context, it just makes optimizations to
>> assume shorter trip counts.
>> 
>>> 
>>> 2) why this can not be done in the consumer side (i.e., faithfully
>>> record the profile data).
>>> 
>>> 
>>> What does this have to do with how faithfully the profile is recorded? We’ve
>>> got fully accurate data, but if the profiling inputs are too small or not
>>> representative, you may still get poor optimization choices.
>> 
>> The point is that there is no need to adjust the weights. It is very
>> easy to check the loop header's profile count to determine how much
>> confidence you want to give (and possibly controlled with flag). The
>> control in this way is more fine grained than blindly changing the
>> weight right after reading the profile data.
>> 
>>> 
>>> 
>>> 
>>> 
>>> 2) result in bad inlining decisions. For instance:
>>>  for (...)
>>>      bar();  // (1)
>>> 
>>> where (1) is the only callsite to bar().   Using the rule, BB count
>>> enclosing the call to bar() can be as low as half of the entry count
>>> of bar().  Inliner will get confused and think there are more hot
>>> callsites to 'bar' and  make suboptimal decisions ..
>>> 
>>> Also if bar has calls to other functions, those callsites will look
>>> hotter than the call to 'bar' …
>>> 
>>> 
>>> Your own proposal for recording entry counts is to record “relative
>>> hotness”, not absolute profile counts.
>>> 
>>> 
>>> The proposal is to record 'global hotness' that can used to compare
>>> relative hotness across procedural boundaries (e.g. callsites in
>>> different callers). Profile counts satisfies this condition.
>>> 
>>> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule
>>> that is then used to compute BlockFrequency and you’re concerned about a
>>> mismatch between that the “relative hotness” recorded for the callee??
>>> 
>>> 
>>> Basically, say the caller is test()
>>> 
>>> bar(){
>>> // ENTRY count =  100 (from profile data)
>>> // ENTRY freq = 1
>>> 
>>> // BB2: Freq(BB2) = 1, count = 100
>>> foo ();              (2)
>>> }
>>> 
>>> 
>>> test() {
>>>  // ENTRY count = 1 (from profile data)
>>>  // Entry Freq = 1
>>>  for (i = 0; i < 100; i++) {
>>>      // BB1: Freq(BB1) = 50 due to Laplace rule
>>>      bar();  // Freq = 50, count = 50    (1)
>>>   }
>>> }
>>> 
>>> With laplace rule, the block freq computed for bar's enclosing BB will
>>> be wrong -- as a result, the bar's enclosing BB's count will  be wrong
>>> too: 50*1/1 = 50.
>>> 
>>> The global hotness of call site (1) & (2) should be the same, but
>>> distorted when Laplace rule is used.
>>> 
>>> Yes, we care about using PGO across routine boundaries for IPO.
>>> 
>>> 
>>> I understand the issue, but my point was that you should simply not do that.
>>> You’re objecting to LaPlace’s rule based on a hypothetical comparison of
>>> block frequencies vs. entry counts. There is nothing in LLVM that does that
>>> now. We don’t even have entry counts.
>> 
>> I am not sure what you mean by 'hypothetical comparison of block
>> frequencies vs entry counts', but it does not seem to be what I mean.
>> What I mean is that
>> 
>> 1) We need a metric to represent global hotness. Profile (execution)
>> count fits the bill
>> 2) There are two ways to compute profile count for BBs
>>   a) directly compute it from the edge count recorded in profile data
>> (and BB Frequency can be directly scaled from it), but this change
>> requires slightly changing MD_prof's meaning or introducing MD_count
>> to record edge count without capping/scaling.
>> 
>>   b) Just recording the entry profile count (minimal change), but do
>> not change MD_prof. This approach will reuse block frequency
>> propagation, but the later relies on unaltered branch
>> probability/weight in order to recompute precisely the count (combined
>> with entry count).
>> 
>> Since people have concerns on a), we chose b). For b), I merely
>> pointed out in the above example that with Laplace rule, the
>> recomputed profile count at the only callsite of 'Bar' can be greatly
>> different from the recorded entry profile count Bar.  Incoming
>> callsite's profile distribution can be good signal for inlining
>> decisions. Such difference will be bad.
>> 
>>> 
>>> I don’t see how you can argue that LaPlace’s rule is bad because it could
>>> affect an apples vs. oranges comparison of something that does not even
>>> exist yet.
>>> 
>> 
>> Of course, PGO for IPA support is exactly the missing (and very
>> important) piece we plan to add -- if it already existed, there will
>> be no problems.
>> 
>> thanks,
>> 
>> David
>> 
>> 
>>> 
>>> 
>>> 
>>> The attached are two cases as well as the frequency graph computed
>>> today (with the laplace rule) and the correct frequency expected.
>>> 
>>> 
>>> I’d be a lot more interested to see a real-world example.
>>> 
>>> 
>>> See my reply above. On the other hand, I'd like to see examples where
>>> LaPlace Rule can actually help improve the profile data quality.
>>> 
>>> 
>>> It’s not about improving the results — it’s about preventing clang from
>>> being overly aggressive about optimizing based on limited profile data.
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev