[PATCH] Indirect call target profiling related profile reader/writer changes

Thu Apr 23 15:56:19 PDT 2015

> On Apr 23, 2015, at 10:34 AM, betulb at codeaurora.org wrote:
> 
>> 
>>> On Apr 14, 2015, at 11:56 AM, betulb at codeaurora.org wrote:
>>> 
>>>> 
>>>> On 04/10/2015 09:25 AM, betulb at codeaurora.org wrote:
>>>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:
>>>>>>> In http://reviews.llvm.org/D8908#153838, @reames wrote:
>>>>>>> 
>>>>>>>> Have the IR level construct patches made it up for review?  If so,
>>>>>>>> can
>>>>>>> So far I've posted two patches. These two patches should apply
>>>>>>> cleanly
>>>>>>> to the tip, working with the present profile infrastructure. The
>>>>>>> next
>>>>>>> set of patches will be the enabler ones: i.e. three more patches one
>>>>>>> for
>>>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for
>>>>>>> review
>>>>>>> later today.
>>>>>>> 
>>>>>>>> you send me a link?  I managed to miss them.
>>>>>>> So far there is this patch and the instrinsic instruction
>>>>>>> definitions:
>>>>>>> http://reviews.llvm.org/D8877. All patches are necessary for getting
>>>>>>> the
>>>>>>> IC targets and having them displayed by the llvm-profdata.
>>>>>> Ok, I'm really not convinced that the instrumentation code needs to
>>>>>> be
>>>>>> or should be an intrinsic.  This seems like something which should be
>>>>>> emitted by the frontend and optimized like any other code.  To say
>>>>>> this
>>>>>> a different way, my instrumentation is going to be entirely different
>>>>>> than your instrumentation.
>>>>>> 
>>>>>> Having said that, I really don't care about this part of the proposed
>>>>>> changes since they aren't going to impact me at all.  I'm am
>>>>>> specifically not objecting to the changes, just commenting.  :)
>>>>>>>> I'm assuming this will be some type of per call site metadata?
>>>>>>> We do assign metadata at the indirect call sites. Format looks like
>>>>>>> as
>>>>>>> follows:
>>>>>>> 
>>>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64
>>>>>>> <total_exec_count>, metadata !"target_fn1Ã¢ÂÂ, i64
>>>>>>> <target_fn1_count>,
>>>>>>> metadata !"target_fn2Ã¢ÂÂ, i64 <target_fn2_count>, Ã¢ÂÂ¦.}
>>>>>>> 
>>>>>>> Currently, we're recording only the top most called five function
>>>>>>> names
>>>>>>> at each indirect call site. Following the string literal
>>>>>>> Ã¢ÂÂindirect_call_targetsÃ¢ÂÂ are the fields  <total_exec_count>
>>>>>>> i.e. a
>>>>>>> 64
>>>>>>> bit value for the total number of times the indirect call is
>>>>>>> executed
>>>>>>> followed by the function names and execution counts of each target.
>>>>>> This was the part I was trying to ask about.  I really want to see
>>>>>> where
>>>>>> you're going with this optimization wise.  My naive guess is that
>>>>>> this
>>>>>> is going to be slightly off for what you actually want.
>>>>>> 
>>>>>> Assuming you're going for profile guided devirtualization (and thus
>>>>>> inlining), being able to check the type of the receiver (as opposed
>>>>>> to
>>>>>> the result of the virtual lookup) might be advantageous.  (Or, to say
>>>>>> it
>>>>>> differently, that's what I'm used to seeing.  Your approach might be
>>>>>> completely reasonable, it's just not what I'm used to seeing.)  Have
>>>>>> you
>>>>>> thought about the tradeoffs here?
>>>>> Not sure if I understood the problem here,
>>>> First, I am not trying to say there is a problem with your approach; I
>>>> am only saying that it's not what I would have expected based on past
>>>> experience.  You may be entirely correct in your approach, you just
>>>> need
>>>> to convince me of that.  :)
>>>>> however, we're recording both
>>>>> the target address and the addresses/names of the instrumented
>>>>> functions
>>>>> during the execution of the instrumented binary. During profile
>>>>> reading
>>>>> these addresses are used to match the target addresses to
>>>>> corresponding
>>>>> functions.
>>>> Ok, let's start from the basics.  For profile guided devirtualization,
>>>> you're constructing a cache from (something) to function pointer and
>>>> using that cache lookup to enable inlining of the hot target.  You have
>>>> two standard choices on what to use as your cache key: the result of
>>>> the
>>>> virtual lookup and the inputs to the virtual lookup.
>>>> 
>>>> Option 1 - Inputs to virtual lookup
>>>> if ((receiver, vtable index) == what I predicted)
>>>>  tartget_I_predicted(); // inline me!!
>>>> else {
>>>>  target = full virtual dispatch();
>>>>  target();
>>>> }
>>>> 
>>>> Option 2 - result of virtual lookup
>>>> target = full virtual dispatch();
>>>> if ('target' == what I predicted)
>>>>  tartget_I_predicted(); // inline me!!
>>>> else {
>>>>  target();
>>>> }
>>>> 
>>>> You seem to be proposing option 2.  I'm saying that I'm used to seeing
>>>> option 1 used.  Both approaches have their appeal, I'm just asking you
>>>> to explain *why* you've chosen the one you apparently have.
>>> 
>>> Not all indirect calls occur from C++ like codes. We're profiling and
>>> optimizing out indirect calls from C codes as well. We're seeing up to
>>> 8%
>>> gains on individual benchmarks in spec. This was measured on our
>>> platform.
>> 
>> We could also consider a hybrid that uses option 1 for vtable calls and
>> option 2 for general function pointer calls. For cases where the code
>> calls several virtual functions on the same object, profiling for option 1
>> could be more efficient if we only record the type of the object once. I
>> have no idea if that is worthwhile but itâs another possibility.
>> 
>> SPEC results are interesting, but Iâd be much more interested to hear
>> how it works for clang. If you build clang with this profiling enabled,
>> what is the time and memory overhead? How much bigger are the profile data
>> files?
> 
> We've instrumented clang using clang-tip. We collected profile data from
> clang using one of the spec benchmarks under -O3. The benchmark was
> composed of 16 files.
> 
> Total size of raw profile files:
>       	Original       IC-profiling        Increase
>       4031666760       5063229256           %25.6
> 
> Total size of merged profile file:
>       Original       IC-profiling        Increase
>       65681320         65973768            %.44

Do you know why the raw files increase by 25% but the merged file doesn’t increase at all? That is surprising to me.

> 
> Average of three runs(using time):
>       Original       IC-profiling        Increase
>        47.38           55.73              %17.6

What about the “plain” profiling? The relevant number is how much the indirect call info improves performance beyond what you get with just the execution counts.

> 
> The above numbers are collected from compiling the whole benchmark. We've
> the IC profile data collected from clang. If there is interest we can
> share the data w/ the community.
> 
> -Betul
> 
>> If you then rebuild clang with PGO, how much does it speed things
>> up?
>> 
>> 
> 
>