<div dir="ltr">Hi all,<div><br></div><div>I am a bit confused about the documentation of the format of the profile data file.</div><div><br></div><div>The Clang user guide here describes it as an ASCII text file:</div><div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23sample-2Dprofile-2Dformat&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=jcEAu49zLTlUJbaNVVhnLrWkf6WDuItofcTfC3QZ7oI&s=b0f2fSelbH62hcuvtSr2l56EU59KN17MB3qQjIMjljo&e=">http://clang.llvm.org/docs/UsersManual.html#sample-profile-format</a><br></div><div><br></div><div>Whereas the posts above and the referenced link describe it as a stream of bytes containing LEB128s:</div><div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.llvm.org_docs_CoverageMappingFormat.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=jcEAu49zLTlUJbaNVVhnLrWkf6WDuItofcTfC3QZ7oI&s=pox7_0shQV84As5pvAy-Q4nG83fMdrrY79k-ytSd7oA&e=">http://www.llvm.org/docs/CoverageMappingFormat.html</a><br></div><div><br></div><div>From experimenting with the latest trunk I can see the latter is correct (well, at least the file I get is not ASCII text).</div><div>Should we update the Clang user guide documentation?</div><div>Or am I just getting confused? Are there two formats, one used for coverage and one used for PGO?</div><div><br></div><div>Cheers,</div><div>    Dario Domizioli</div><div>    SN Systems - Sony Computer Entertainment Group</div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 7 May 2015 at 16:43, Bob Wilson <span dir="ltr"><<a href="mailto:bob.wilson@apple.com" target="_blank">bob.wilson@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

> On May 7, 2015, at 12:55 AM, Hayden Livingston <<a href="mailto:halivingston@gmail.com">halivingston@gmail.com</a>> wrote:<br>

><br>

> Can you tell us if you're continuing to use the same approach as<br>

> described in one of the LLVM meetings, i.e. instrument at the clang<br>

> AST level?<br>

<br>

</span>Yes, that is the approach we’re using.<br>

<span class=""><br>

><br>

> Also, do you generate GCOV files, some yaml, or is this a separate format?<br>

<br>

</span>It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.<br>

<br>

There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.llvm.org_docs_CoverageMappingFormat.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=jcEAu49zLTlUJbaNVVhnLrWkf6WDuItofcTfC3QZ7oI&s=pox7_0shQV84As5pvAy-Q4nG83fMdrrY79k-ytSd7oA&e=" target="_blank">http://www.llvm.org/docs/CoverageMappingFormat.html</a><br>

<span class=""><br>

><br>

> And finally in the meeting you had given how you assign counters to<br>

> the blocks, an algorithm to minimize the number of insertions. Is that<br>

> algorithm a well-known one or a custom one? Is that described<br>

> somewhere?<br>

<br>

</span>It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.<br>

<div class="HOEnZb"><div class="h5"><br>

><br>

> On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <<a href="mailto:davidxl@google.com">davidxl@google.com</a>> wrote:<br>

>> Bob,<br>

>><br>

>>> Which workload is better? I don’t at all trust users to get this right, at<br>

>>> least for real, non-benchmark code.<br>

>><br>

>> We do have a lot of users (real world apps) who can get this right --<br>

>> I am not joking ;)<br>

>><br>

>>><br>

>>><br>

>>> Without the rule, the two workload at least produces consistent<br>

>>> profile data.  With the Laplace rule, you get 50 in one case, and 66<br>

>>> in the other.<br>

>>><br>

>>><br>

>>> Yes, but you’ve got more information in one case than the other. This is a<br>

>>> feature IMO, not a bug. It’s entirely possible that with workload 2, the<br>

>>> loop may have executed for a drastically different number of iterations. The<br>

>>> fact that it did not, i.e., that it was consistent with workload 1, is more<br>

>>> information that you did not have before. It makes sense for the compiler to<br>

>>> be more aggressive when it has more data.<br>

>><br>

>> But the decision by the compiler is arbitrary and not necessarily<br>

>> correct.  For instance, the single run used in the training may have<br>

>> actually executed much fewer number of iterations than average. With<br>

>> Laplace rule, the iteration count becomes even smaller. My point is<br>

>> that there is no way for compiler to tell how good the data is nor is<br>

>> the compiler in a good position to make that judgement.  By so doing,<br>

>> the users who carefully prune their workload to reduce runtime gets<br>

>> punished for no reason.<br>

>><br>

>><br>

>>><br>

>>><br>

>>> Having some technology to improve confidence of the profile data is<br>

>>> fine, but I don't see<br>

>>> 1) how laplace rule is good for it<br>

>>>><br>

>>> What do you not understand about it? As the counts get larger, LaPlace’s<br>

>>> rule fades into the noise. It only makes a difference for cases where some<br>

>>> of the counts are *very* small, and in those cases, it very simply adjust<br>

>>> the weights to make optimizations less aggressive.<br>

>><br>

>> Strictly speaking, in loop context, it just makes optimizations to<br>

>> assume shorter trip counts.<br>

>><br>

>>><br>

>>> 2) why this can not be done in the consumer side (i.e., faithfully<br>

>>> record the profile data).<br>

>>><br>

>>><br>

>>> What does this have to do with how faithfully the profile is recorded? We’ve<br>

>>> got fully accurate data, but if the profiling inputs are too small or not<br>

>>> representative, you may still get poor optimization choices.<br>

>><br>

>> The point is that there is no need to adjust the weights. It is very<br>

>> easy to check the loop header's profile count to determine how much<br>

>> confidence you want to give (and possibly controlled with flag). The<br>

>> control in this way is more fine grained than blindly changing the<br>

>> weight right after reading the profile data.<br>

>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>> 2) result in bad inlining decisions. For instance:<br>

>>>  for (...)<br>

>>>      bar();  // (1)<br>

>>><br>

>>> where (1) is the only callsite to bar().   Using the rule, BB count<br>

>>> enclosing the call to bar() can be as low as half of the entry count<br>

>>> of bar().  Inliner will get confused and think there are more hot<br>

>>> callsites to 'bar' and  make suboptimal decisions ..<br>

>>><br>

>>> Also if bar has calls to other functions, those callsites will look<br>

>>> hotter than the call to 'bar' …<br>

>>><br>

>>><br>

>>> Your own proposal for recording entry counts is to record “relative<br>

>>> hotness”, not absolute profile counts.<br>

>>><br>

>>><br>

>>> The proposal is to record 'global hotness' that can used to compare<br>

>>> relative hotness across procedural boundaries (e.g. callsites in<br>

>>> different callers). Profile counts satisfies this condition.<br>

>>><br>

>>> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule<br>

>>> that is then used to compute BlockFrequency and you’re concerned about a<br>

>>> mismatch between that the “relative hotness” recorded for the callee??<br>

>>><br>

>>><br>

>>> Basically, say the caller is test()<br>

>>><br>

>>> bar(){<br>

>>> // ENTRY count =  100 (from profile data)<br>

>>> // ENTRY freq = 1<br>

>>><br>

>>> // BB2: Freq(BB2) = 1, count = 100<br>

>>> foo ();              (2)<br>

>>> }<br>

>>><br>

>>><br>

>>> test() {<br>

>>>  // ENTRY count = 1 (from profile data)<br>

>>>  // Entry Freq = 1<br>

>>>  for (i = 0; i < 100; i++) {<br>

>>>      // BB1: Freq(BB1) = 50 due to Laplace rule<br>

>>>      bar();  // Freq = 50, count = 50    (1)<br>

>>>   }<br>

>>> }<br>

>>><br>

>>> With laplace rule, the block freq computed for bar's enclosing BB will<br>

>>> be wrong -- as a result, the bar's enclosing BB's count will  be wrong<br>

>>> too: 50*1/1 = 50.<br>

>>><br>

>>> The global hotness of call site (1) & (2) should be the same, but<br>

>>> distorted when Laplace rule is used.<br>

>>><br>

>>> Yes, we care about using PGO across routine boundaries for IPO.<br>

>>><br>

>>><br>

>>> I understand the issue, but my point was that you should simply not do that.<br>

>>> You’re objecting to LaPlace’s rule based on a hypothetical comparison of<br>

>>> block frequencies vs. entry counts. There is nothing in LLVM that does that<br>

>>> now. We don’t even have entry counts.<br>

>><br>

>> I am not sure what you mean by 'hypothetical comparison of block<br>

>> frequencies vs entry counts', but it does not seem to be what I mean.<br>

>> What I mean is that<br>

>><br>

>> 1) We need a metric to represent global hotness. Profile (execution)<br>

>> count fits the bill<br>

>> 2) There are two ways to compute profile count for BBs<br>

>>   a) directly compute it from the edge count recorded in profile data<br>

>> (and BB Frequency can be directly scaled from it), but this change<br>

>> requires slightly changing MD_prof's meaning or introducing MD_count<br>

>> to record edge count without capping/scaling.<br>

>><br>

>>   b) Just recording the entry profile count (minimal change), but do<br>

>> not change MD_prof. This approach will reuse block frequency<br>

>> propagation, but the later relies on unaltered branch<br>

>> probability/weight in order to recompute precisely the count (combined<br>

>> with entry count).<br>

>><br>

>> Since people have concerns on a), we chose b). For b), I merely<br>

>> pointed out in the above example that with Laplace rule, the<br>

>> recomputed profile count at the only callsite of 'Bar' can be greatly<br>

>> different from the recorded entry profile count Bar.  Incoming<br>

>> callsite's profile distribution can be good signal for inlining<br>

>> decisions. Such difference will be bad.<br>

>><br>

>>><br>

>>> I don’t see how you can argue that LaPlace’s rule is bad because it could<br>

>>> affect an apples vs. oranges comparison of something that does not even<br>

>>> exist yet.<br>

>>><br>

>><br>

>> Of course, PGO for IPA support is exactly the missing (and very<br>

>> important) piece we plan to add -- if it already existed, there will<br>

>> be no problems.<br>

>><br>

>> thanks,<br>

>><br>

>> David<br>

>><br>

>><br>

>>><br>

>>><br>

>>><br>

>>> The attached are two cases as well as the frequency graph computed<br>

>>> today (with the laplace rule) and the correct frequency expected.<br>

>>><br>

>>><br>

>>> I’d be a lot more interested to see a real-world example.<br>

>>><br>

>>><br>

>>> See my reply above. On the other hand, I'd like to see examples where<br>

>>> LaPlace Rule can actually help improve the profile data quality.<br>

>>><br>

>>><br>

>>> It’s not about improving the results — it’s about preventing clang from<br>

>>> being overly aggressive about optimizing based on limited profile data.<br>

>><br>

>> _______________________________________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

>> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</div></div></blockquote></div><br></div>