<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 23, 2015 at 3:56 PM, Bob Wilson <span dir="ltr"><<a href="mailto:bob.wilson@apple.com" target="_blank">bob.wilson@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

> On Apr 23, 2015, at 10:34 AM, <a href="mailto:betulb@codeaurora.org">betulb@codeaurora.org</a> wrote:<br>

><br>

>><br>

>>> On Apr 14, 2015, at 11:56 AM, <a href="mailto:betulb@codeaurora.org">betulb@codeaurora.org</a> wrote:<br>

>>><br>

>>>><br>

>>>> On 04/10/2015 09:25 AM, <a href="mailto:betulb@codeaurora.org">betulb@codeaurora.org</a> wrote:<br>

>>>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:<br>

>>>>>>> In <a href="http://reviews.llvm.org/D8908#153838" target="_blank">http://reviews.llvm.org/D8908#153838</a>, @reames wrote:<br>

>>>>>>><br>

>>>>>>>> Have the IR level construct patches made it up for review?  If so,<br>

>>>>>>>> can<br>

>>>>>>> So far I've posted two patches. These two patches should apply<br>

>>>>>>> cleanly<br>

>>>>>>> to the tip, working with the present profile infrastructure. The<br>

>>>>>>> next<br>

>>>>>>> set of patches will be the enabler ones: i.e. three more patches one<br>

>>>>>>> for<br>

>>>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for<br>

>>>>>>> review<br>

>>>>>>> later today.<br>

>>>>>>><br>

>>>>>>>> you send me a link?  I managed to miss them.<br>

>>>>>>> So far there is this patch and the instrinsic instruction<br>

>>>>>>> definitions:<br>

>>>>>>> <a href="http://reviews.llvm.org/D8877" target="_blank">http://reviews.llvm.org/D8877</a>. All patches are necessary for getting<br>

>>>>>>> the<br>

>>>>>>> IC targets and having them displayed by the llvm-profdata.<br>

>>>>>> Ok, I'm really not convinced that the instrumentation code needs to<br>

>>>>>> be<br>

>>>>>> or should be an intrinsic.  This seems like something which should be<br>

>>>>>> emitted by the frontend and optimized like any other code.  To say<br>

>>>>>> this<br>

>>>>>> a different way, my instrumentation is going to be entirely different<br>

>>>>>> than your instrumentation.<br>

>>>>>><br>

>>>>>> Having said that, I really don't care about this part of the proposed<br>

>>>>>> changes since they aren't going to impact me at all.  I'm am<br>

>>>>>> specifically not objecting to the changes, just commenting.  :)<br>

>>>>>>>> I'm assuming this will be some type of per call site metadata?<br>

>>>>>>> We do assign metadata at the indirect call sites. Format looks like<br>

>>>>>>> as<br>

>>>>>>> follows:<br>

>>>>>>><br>

>>>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64<br>

>>>>>>> <total_exec_count>, metadata !"target_fn1Ã¢Â€Â , i64<br>

>>>>>>> <target_fn1_count>,<br>

>>>>>>> metadata !"target_fn2Ã¢Â€Â , i64 <target_fn2_count>, Ã¢Â€Â¦.}<br>

>>>>>>><br>

>>>>>>> Currently, we're recording only the top most called five function<br>

>>>>>>> names<br>

>>>>>>> at each indirect call site. Following the string literal<br>

>>>>>>> Ã¢Â€Âœindirect_call_targetsÃ¢Â€Â  are the fields  <total_exec_count><br>

>>>>>>> i.e. a<br>

>>>>>>> 64<br>

>>>>>>> bit value for the total number of times the indirect call is<br>

>>>>>>> executed<br>

>>>>>>> followed by the function names and execution counts of each target.<br>

>>>>>> This was the part I was trying to ask about.  I really want to see<br>

>>>>>> where<br>

>>>>>> you're going with this optimization wise.  My naive guess is that<br>

>>>>>> this<br>

>>>>>> is going to be slightly off for what you actually want.<br>

>>>>>><br>

>>>>>> Assuming you're going for profile guided devirtualization (and thus<br>

>>>>>> inlining), being able to check the type of the receiver (as opposed<br>

>>>>>> to<br>

>>>>>> the result of the virtual lookup) might be advantageous.  (Or, to say<br>

>>>>>> it<br>

>>>>>> differently, that's what I'm used to seeing.  Your approach might be<br>

>>>>>> completely reasonable, it's just not what I'm used to seeing.)  Have<br>

>>>>>> you<br>

>>>>>> thought about the tradeoffs here?<br>

>>>>> Not sure if I understood the problem here,<br>

>>>> First, I am not trying to say there is a problem with your approach; I<br>

>>>> am only saying that it's not what I would have expected based on past<br>

>>>> experience.  You may be entirely correct in your approach, you just<br>

>>>> need<br>

>>>> to convince me of that.  :)<br>

>>>>> however, we're recording both<br>

>>>>> the target address and the addresses/names of the instrumented<br>

>>>>> functions<br>

>>>>> during the execution of the instrumented binary. During profile<br>

>>>>> reading<br>

>>>>> these addresses are used to match the target addresses to<br>

>>>>> corresponding<br>

>>>>> functions.<br>

>>>> Ok, let's start from the basics.  For profile guided devirtualization,<br>

>>>> you're constructing a cache from (something) to function pointer and<br>

>>>> using that cache lookup to enable inlining of the hot target.  You have<br>

>>>> two standard choices on what to use as your cache key: the result of<br>

>>>> the<br>

>>>> virtual lookup and the inputs to the virtual lookup.<br>

>>>><br>

>>>> Option 1 - Inputs to virtual lookup<br>

>>>> if ((receiver, vtable index) == what I predicted)<br>

>>>>  tartget_I_predicted(); // inline me!!<br>

>>>> else {<br>

>>>>  target = full virtual dispatch();<br>

>>>>  target();<br>

>>>> }<br>

>>>><br>

>>>> Option 2 - result of virtual lookup<br>

>>>> target = full virtual dispatch();<br>

>>>> if ('target' == what I predicted)<br>

>>>>  tartget_I_predicted(); // inline me!!<br>

>>>> else {<br>

>>>>  target();<br>

>>>> }<br>

>>>><br>

>>>> You seem to be proposing option 2.  I'm saying that I'm used to seeing<br>

>>>> option 1 used.  Both approaches have their appeal, I'm just asking you<br>

>>>> to explain *why* you've chosen the one you apparently have.<br>

>>><br>

>>> Not all indirect calls occur from C++ like codes. We're profiling and<br>

>>> optimizing out indirect calls from C codes as well. We're seeing up to<br>

>>> 8%<br>

>>> gains on individual benchmarks in spec. This was measured on our<br>

>>> platform.<br>

>><br>

>> We could also consider a hybrid that uses option 1 for vtable calls and<br>

>> option 2 for general function pointer calls. For cases where the code<br>

>> calls several virtual functions on the same object, profiling for option 1<br>

>> could be more efficient if we only record the type of the object once. I<br>

>> have no idea if that is worthwhile but itâ€™s another possibility.<br>

>><br>

>> SPEC results are interesting, but Iâ€™d be much more interested to hear<br>

>> how it works for clang. If you build clang with this profiling enabled,<br>

>> what is the time and memory overhead? How much bigger are the profile data<br>

>> files?<br>

><br>

> We've instrumented clang using clang-tip. We collected profile data from<br>

> clang using one of the spec benchmarks under -O3. The benchmark was<br>

> composed of 16 files.<br>

><br>

> Total size of raw profile files:<br>

>               Original       IC-profiling        Increase<br>

>       4031666760       5063229256           %25.6<br>

><br>

> Total size of merged profile file:<br>

>       Original       IC-profiling        Increase<br>

>       65681320         65973768            %.44<br>

<br>

</div></div>Do you know why the raw files increase by 25% but the merged file doesn’t increase at all? That is surprising to me.<br>

<span class=""><br>

><br>

> Average of three runs(using time):<br>

>       Original       IC-profiling        Increase<br>

>        47.38           55.73              %17.6<br>

<br>

</span>What about the “plain” profiling? The relevant number is how much the indirect call info improves performance beyond what you get with just the execution counts.<br></blockquote><div><br></div><div>Clang, unfortunately does not benefit from indirect call promotion transformation with the testing I have done (using GCC's PGO).   Out of the top 10 hottest indirect callsites, the second hottest has 39 targets. Among them there are also ones with 15, 29, and 9 targets with no clear dominating one.  The hottest callsite has only one target, its function is a wrapper to another large function which is probably not a candidate for inlining.</div><div><br></div><div>Though not helping clang, IC promotion does help real world apps other than SPEC. We have seen improvements in the range of 2-4%.</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5"><br>

><br>

> The above numbers are collected from compiling the whole benchmark. We've<br>

> the IC profile data collected from clang. If there is interest we can<br>

> share the data w/ the community.<br>

><br>

> -Betul<br>

><br>

>> If you then rebuild clang with PGO, how much does it speed things<br>

>> up?<br>

>><br>

>><br>

><br>

><br>

<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div><br></div></div>