<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 15, 2015 at 10:13 AM, Bob Wilson <span dir="ltr"><<a href="mailto:bob.wilson@apple.com" target="_blank">bob.wilson@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

> On Apr 14, 2015, at 11:56 AM, <a href="mailto:betulb@codeaurora.org">betulb@codeaurora.org</a> wrote:<br>

><br>

>><br>

>> On 04/10/2015 09:25 AM, <a href="mailto:betulb@codeaurora.org">betulb@codeaurora.org</a> wrote:<br>

>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:<br>

>>>>> In <a href="http://reviews.llvm.org/D8908#153838" target="_blank">http://reviews.llvm.org/D8908#153838</a>, @reames wrote:<br>

>>>>><br>

>>>>>> Have the IR level construct patches made it up for review?  If so,<br>

>>>>>> can<br>

>>>>> So far I've posted two patches. These two patches should apply cleanly<br>

>>>>> to the tip, working with the present profile infrastructure. The next<br>

>>>>> set of patches will be the enabler ones: i.e. three more patches one<br>

>>>>> for<br>

>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for review<br>

>>>>> later today.<br>

>>>>><br>

>>>>>> you send me a link?  I managed to miss them.<br>

>>>>> So far there is this patch and the instrinsic instruction definitions:<br>

>>>>> <a href="http://reviews.llvm.org/D8877" target="_blank">http://reviews.llvm.org/D8877</a>. All patches are necessary for getting<br>

>>>>> the<br>

>>>>> IC targets and having them displayed by the llvm-profdata.<br>

>>>> Ok, I'm really not convinced that the instrumentation code needs to be<br>

>>>> or should be an intrinsic.  This seems like something which should be<br>

>>>> emitted by the frontend and optimized like any other code.  To say this<br>

>>>> a different way, my instrumentation is going to be entirely different<br>

>>>> than your instrumentation.<br>

>>>><br>

>>>> Having said that, I really don't care about this part of the proposed<br>

>>>> changes since they aren't going to impact me at all.  I'm am<br>

>>>> specifically not objecting to the changes, just commenting.  :)<br>

>>>>>> I'm assuming this will be some type of per call site metadata?<br>

>>>>> We do assign metadata at the indirect call sites. Format looks like as<br>

>>>>> follows:<br>

>>>>><br>

>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64<br>

>>>>> <total_exec_count>, metadata !"target_fn1â€ , i64 <target_fn1_count>,<br>

>>>>> metadata !"target_fn2â€ , i64 <target_fn2_count>, â€¦.}<br>

>>>>><br>

>>>>> Currently, we're recording only the top most called five function<br>

>>>>> names<br>

>>>>> at each indirect call site. Following the string literal<br>

>>>>> â€œindirect_call_targetsâ€  are the fields  <total_exec_count> i.e. a<br>

>>>>> 64<br>

>>>>> bit value for the total number of times the indirect call is executed<br>

>>>>> followed by the function names and execution counts of each target.<br>

>>>> This was the part I was trying to ask about.  I really want to see<br>

>>>> where<br>

>>>> you're going with this optimization wise.  My naive guess is that this<br>

>>>> is going to be slightly off for what you actually want.<br>

>>>><br>

>>>> Assuming you're going for profile guided devirtualization (and thus<br>

>>>> inlining), being able to check the type of the receiver (as opposed to<br>

>>>> the result of the virtual lookup) might be advantageous.  (Or, to say<br>

>>>> it<br>

>>>> differently, that's what I'm used to seeing.  Your approach might be<br>

>>>> completely reasonable, it's just not what I'm used to seeing.)  Have<br>

>>>> you<br>

>>>> thought about the tradeoffs here?<br>

>>> Not sure if I understood the problem here,<br>

>> First, I am not trying to say there is a problem with your approach; I<br>

>> am only saying that it's not what I would have expected based on past<br>

>> experience.  You may be entirely correct in your approach, you just need<br>

>> to convince me of that.  :)<br>

>>> however, we're recording both<br>

>>> the target address and the addresses/names of the instrumented functions<br>

>>> during the execution of the instrumented binary. During profile reading<br>

>>> these addresses are used to match the target addresses to corresponding<br>

>>> functions.<br>

>> Ok, let's start from the basics.  For profile guided devirtualization,<br>

>> you're constructing a cache from (something) to function pointer and<br>

>> using that cache lookup to enable inlining of the hot target.  You have<br>

>> two standard choices on what to use as your cache key: the result of the<br>

>> virtual lookup and the inputs to the virtual lookup.<br>

>><br>

>> Option 1 - Inputs to virtual lookup<br>

>> if ((receiver, vtable index) == what I predicted)<br>

>>   tartget_I_predicted(); // inline me!!<br>

>> else {<br>

>>   target = full virtual dispatch();<br>

>>   target();<br>

>> }<br>

>><br>

>> Option 2 - result of virtual lookup<br>

>> target = full virtual dispatch();<br>

>> if ('target' == what I predicted)<br>

>>   tartget_I_predicted(); // inline me!!<br>

>> else {<br>

>>   target();<br>

>> }<br>

>><br>

>> You seem to be proposing option 2.  I'm saying that I'm used to seeing<br>

>> option 1 used.  Both approaches have their appeal, I'm just asking you<br>

>> to explain *why* you've chosen the one you apparently have.<br>

><br>

> Not all indirect calls occur from C++ like codes. We're profiling and<br>

> optimizing out indirect calls from C codes as well. We're seeing up to 8%<br>

> gains on individual benchmarks in spec. This was measured on our platform.<br>

<br>

</div></div>We could also consider a hybrid that uses option 1 for vtable calls and option 2 for general function pointer calls. For cases where the code calls several virtual functions on the same object, profiling for option 1 could be more efficient if we only record the type of the object once. I have no idea if that is worthwhile but it’s another possibility.<br></blockquote><div><br></div><div>option 1 certainly has lower instrumentation overhead as well as more efficient value profiling transformations but I consider it a separate feature from function target profiling which can be more widely used.  Options can be introduced to pick the flavor for C++ programs.</div><div><br></div><div>David </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

SPEC results are interesting, but I’d be much more interested to hear how it works for clang. If you build clang with this profiling enabled, what is the time and memory overhead? How much bigger are the profile data files? If you then rebuild clang with PGO, how much does it speed things up?<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div><br></div></div>