<div dir="ltr">Hi all,<div><br></div><div>I am a bit confused about the documentation of the format of the profile data file.</div><div><br></div><div>The Clang user guide here describes it as an ASCII text file:</div><div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__clang.llvm.org_docs_UsersManual.html-23sample-2Dprofile-2Dformat&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=jcEAu49zLTlUJbaNVVhnLrWkf6WDuItofcTfC3QZ7oI&s=b0f2fSelbH62hcuvtSr2l56EU59KN17MB3qQjIMjljo&e=">http://clang.llvm.org/docs/UsersManual.html#sample-profile-format</a><br></div><div><br></div><div>Whereas the posts above and the referenced link describe it as a stream of bytes containing LEB128s:</div><div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.llvm.org_docs_CoverageMappingFormat.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=jcEAu49zLTlUJbaNVVhnLrWkf6WDuItofcTfC3QZ7oI&s=pox7_0shQV84As5pvAy-Q4nG83fMdrrY79k-ytSd7oA&e=">http://www.llvm.org/docs/CoverageMappingFormat.html</a><br></div><div><br></div><div>From experimenting with the latest trunk I can see the latter is correct (well, at least the file I get is not ASCII text).</div><div>Should we update the Clang user guide documentation?</div><div>Or am I just getting confused? Are there two formats, one used for coverage and one used for PGO?</div><div><br></div><div>Cheers,</div><div> Dario Domizioli</div><div> SN Systems - Sony Computer Entertainment Group</div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 7 May 2015 at 16:43, Bob Wilson <span dir="ltr"><<a href="mailto:bob.wilson@apple.com" target="_blank">bob.wilson@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>
> On May 7, 2015, at 12:55 AM, Hayden Livingston <<a href="mailto:halivingston@gmail.com">halivingston@gmail.com</a>> wrote:<br>
><br>
> Can you tell us if you're continuing to use the same approach as<br>
> described in one of the LLVM meetings, i.e. instrument at the clang<br>
> AST level?<br>
<br>
</span>Yes, that is the approach we’re using.<br>
<span class=""><br>
><br>
> Also, do you generate GCOV files, some yaml, or is this a separate format?<br>
<br>
</span>It is a separate format. The code for reading/writing profile data is in compiler-rt/lib/profile/.<br>
<br>
There is also a separate format for mapping the profile data back to source locations for code coverage testing. Details here: <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__www.llvm.org_docs_CoverageMappingFormat.html&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=jcEAu49zLTlUJbaNVVhnLrWkf6WDuItofcTfC3QZ7oI&s=pox7_0shQV84As5pvAy-Q4nG83fMdrrY79k-ytSd7oA&e=" target="_blank">http://www.llvm.org/docs/CoverageMappingFormat.html</a><br>
<span class=""><br>
><br>
> And finally in the meeting you had given how you assign counters to<br>
> the blocks, an algorithm to minimize the number of insertions. Is that<br>
> algorithm a well-known one or a custom one? Is that described<br>
> somewhere?<br>
<br>
</span>It is a custom approach. I don’t think we have a written description but the code is pretty straightforward. Look at ComputeRegionCounts in clang’s lib/CodeGen/CodeGenPGO.cpp source file.<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> On Wed, Mar 25, 2015 at 10:47 PM, Xinliang David Li <<a href="mailto:davidxl@google.com">davidxl@google.com</a>> wrote:<br>
>> Bob,<br>
>><br>
>>> Which workload is better? I don’t at all trust users to get this right, at<br>
>>> least for real, non-benchmark code.<br>
>><br>
>> We do have a lot of users (real world apps) who can get this right --<br>
>> I am not joking ;)<br>
>><br>
>>><br>
>>><br>
>>> Without the rule, the two workload at least produces consistent<br>
>>> profile data. With the Laplace rule, you get 50 in one case, and 66<br>
>>> in the other.<br>
>>><br>
>>><br>
>>> Yes, but you’ve got more information in one case than the other. This is a<br>
>>> feature IMO, not a bug. It’s entirely possible that with workload 2, the<br>
>>> loop may have executed for a drastically different number of iterations. The<br>
>>> fact that it did not, i.e., that it was consistent with workload 1, is more<br>
>>> information that you did not have before. It makes sense for the compiler to<br>
>>> be more aggressive when it has more data.<br>
>><br>
>> But the decision by the compiler is arbitrary and not necessarily<br>
>> correct. For instance, the single run used in the training may have<br>
>> actually executed much fewer number of iterations than average. With<br>
>> Laplace rule, the iteration count becomes even smaller. My point is<br>
>> that there is no way for compiler to tell how good the data is nor is<br>
>> the compiler in a good position to make that judgement. By so doing,<br>
>> the users who carefully prune their workload to reduce runtime gets<br>
>> punished for no reason.<br>
>><br>
>><br>
>>><br>
>>><br>
>>> Having some technology to improve confidence of the profile data is<br>
>>> fine, but I don't see<br>
>>> 1) how laplace rule is good for it<br>
>>>><br>
>>> What do you not understand about it? As the counts get larger, LaPlace’s<br>
>>> rule fades into the noise. It only makes a difference for cases where some<br>
>>> of the counts are *very* small, and in those cases, it very simply adjust<br>
>>> the weights to make optimizations less aggressive.<br>
>><br>
>> Strictly speaking, in loop context, it just makes optimizations to<br>
>> assume shorter trip counts.<br>
>><br>
>>><br>
>>> 2) why this can not be done in the consumer side (i.e., faithfully<br>
>>> record the profile data).<br>
>>><br>
>>><br>
>>> What does this have to do with how faithfully the profile is recorded? We’ve<br>
>>> got fully accurate data, but if the profiling inputs are too small or not<br>
>>> representative, you may still get poor optimization choices.<br>
>><br>
>> The point is that there is no need to adjust the weights. It is very<br>
>> easy to check the loop header's profile count to determine how much<br>
>> confidence you want to give (and possibly controlled with flag). The<br>
>> control in this way is more fine grained than blindly changing the<br>
>> weight right after reading the profile data.<br>
>><br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> 2) result in bad inlining decisions. For instance:<br>
>>> for (...)<br>
>>> bar(); // (1)<br>
>>><br>
>>> where (1) is the only callsite to bar(). Using the rule, BB count<br>
>>> enclosing the call to bar() can be as low as half of the entry count<br>
>>> of bar(). Inliner will get confused and think there are more hot<br>
>>> callsites to 'bar' and make suboptimal decisions ..<br>
>>><br>
>>> Also if bar has calls to other functions, those callsites will look<br>
>>> hotter than the call to 'bar' …<br>
>>><br>
>>><br>
>>> Your own proposal for recording entry counts is to record “relative<br>
>>> hotness”, not absolute profile counts.<br>
>>><br>
>>><br>
>>> The proposal is to record 'global hotness' that can used to compare<br>
>>> relative hotness across procedural boundaries (e.g. callsites in<br>
>>> different callers). Profile counts satisfies this condition.<br>
>>><br>
>>> On the caller’s side, we’ve got a branch weight influenced by LaPlace’s rule<br>
>>> that is then used to compute BlockFrequency and you’re concerned about a<br>
>>> mismatch between that the “relative hotness” recorded for the callee??<br>
>>><br>
>>><br>
>>> Basically, say the caller is test()<br>
>>><br>
>>> bar(){<br>
>>> // ENTRY count = 100 (from profile data)<br>
>>> // ENTRY freq = 1<br>
>>><br>
>>> // BB2: Freq(BB2) = 1, count = 100<br>
>>> foo (); (2)<br>
>>> }<br>
>>><br>
>>><br>
>>> test() {<br>
>>> // ENTRY count = 1 (from profile data)<br>
>>> // Entry Freq = 1<br>
>>> for (i = 0; i < 100; i++) {<br>
>>> // BB1: Freq(BB1) = 50 due to Laplace rule<br>
>>> bar(); // Freq = 50, count = 50 (1)<br>
>>> }<br>
>>> }<br>
>>><br>
>>> With laplace rule, the block freq computed for bar's enclosing BB will<br>
>>> be wrong -- as a result, the bar's enclosing BB's count will be wrong<br>
>>> too: 50*1/1 = 50.<br>
>>><br>
>>> The global hotness of call site (1) & (2) should be the same, but<br>
>>> distorted when Laplace rule is used.<br>
>>><br>
>>> Yes, we care about using PGO across routine boundaries for IPO.<br>
>>><br>
>>><br>
>>> I understand the issue, but my point was that you should simply not do that.<br>
>>> You’re objecting to LaPlace’s rule based on a hypothetical comparison of<br>
>>> block frequencies vs. entry counts. There is nothing in LLVM that does that<br>
>>> now. We don’t even have entry counts.<br>
>><br>
>> I am not sure what you mean by 'hypothetical comparison of block<br>
>> frequencies vs entry counts', but it does not seem to be what I mean.<br>
>> What I mean is that<br>
>><br>
>> 1) We need a metric to represent global hotness. Profile (execution)<br>
>> count fits the bill<br>
>> 2) There are two ways to compute profile count for BBs<br>
>> a) directly compute it from the edge count recorded in profile data<br>
>> (and BB Frequency can be directly scaled from it), but this change<br>
>> requires slightly changing MD_prof's meaning or introducing MD_count<br>
>> to record edge count without capping/scaling.<br>
>><br>
>> b) Just recording the entry profile count (minimal change), but do<br>
>> not change MD_prof. This approach will reuse block frequency<br>
>> propagation, but the later relies on unaltered branch<br>
>> probability/weight in order to recompute precisely the count (combined<br>
>> with entry count).<br>
>><br>
>> Since people have concerns on a), we chose b). For b), I merely<br>
>> pointed out in the above example that with Laplace rule, the<br>
>> recomputed profile count at the only callsite of 'Bar' can be greatly<br>
>> different from the recorded entry profile count Bar. Incoming<br>
>> callsite's profile distribution can be good signal for inlining<br>
>> decisions. Such difference will be bad.<br>
>><br>
>>><br>
>>> I don’t see how you can argue that LaPlace’s rule is bad because it could<br>
>>> affect an apples vs. oranges comparison of something that does not even<br>
>>> exist yet.<br>
>>><br>
>><br>
>> Of course, PGO for IPA support is exactly the missing (and very<br>
>> important) piece we plan to add -- if it already existed, there will<br>
>> be no problems.<br>
>><br>
>> thanks,<br>
>><br>
>> David<br>
>><br>
>><br>
>>><br>
>>><br>
>>><br>
>>> The attached are two cases as well as the frequency graph computed<br>
>>> today (with the laplace rule) and the correct frequency expected.<br>
>>><br>
>>><br>
>>> I’d be a lot more interested to see a real-world example.<br>
>>><br>
>>><br>
>>> See my reply above. On the other hand, I'd like to see examples where<br>
>>> LaPlace Rule can actually help improve the profile data quality.<br>
>>><br>
>>><br>
>>> It’s not about improving the results — it’s about preventing clang from<br>
>>> being overly aggressive about optimizing based on limited profile data.<br>
>><br>
>> _______________________________________________<br>
>> LLVM Developers mailing list<br>
>> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
>> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
<br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
</div></div></blockquote></div><br></div>