[PATCH] Indirect call target profiling related profile reader/writer changes

Thu Apr 23 16:22:57 PDT 2015

>
>> On Apr 23, 2015, at 10:34 AM, betulb at codeaurora.org wrote:
>>
>>>
>>>> On Apr 14, 2015, at 11:56 AM, betulb at codeaurora.org wrote:
>>>>
>>>>>
>>>>> On 04/10/2015 09:25 AM, betulb at codeaurora.org wrote:
>>>>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:
>>>>>>>> In http://reviews.llvm.org/D8908#153838, @reames wrote:
>>>>>>>>
>>>>>>>>> Have the IR level construct patches made it up for review?  If
>>>>>>>>> so,
>>>>>>>>> can
>>>>>>>> So far I've posted two patches. These two patches should apply
>>>>>>>> cleanly
>>>>>>>> to the tip, working with the present profile infrastructure. The
>>>>>>>> next
>>>>>>>> set of patches will be the enabler ones: i.e. three more patches
>>>>>>>> one
>>>>>>>> for
>>>>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for
>>>>>>>> review
>>>>>>>> later today.
>>>>>>>>
>>>>>>>>> you send me a link?  I managed to miss them.
>>>>>>>> So far there is this patch and the instrinsic instruction
>>>>>>>> definitions:
>>>>>>>> http://reviews.llvm.org/D8877. All patches are necessary for
>>>>>>>> getting
>>>>>>>> the
>>>>>>>> IC targets and having them displayed by the llvm-profdata.
>>>>>>> Ok, I'm really not convinced that the instrumentation code needs to
>>>>>>> be
>>>>>>> or should be an intrinsic.  This seems like something which should
>>>>>>> be
>>>>>>> emitted by the frontend and optimized like any other code.  To say
>>>>>>> this
>>>>>>> a different way, my instrumentation is going to be entirely
>>>>>>> different
>>>>>>> than your instrumentation.
>>>>>>>
>>>>>>> Having said that, I really don't care about this part of the
>>>>>>> proposed
>>>>>>> changes since they aren't going to impact me at all.  I'm am
>>>>>>> specifically not objecting to the changes, just commenting.  :)
>>>>>>>>> I'm assuming this will be some type of per call site metadata?
>>>>>>>> We do assign metadata at the indirect call sites. Format looks
>>>>>>>> like
>>>>>>>> as
>>>>>>>> follows:
>>>>>>>>
>>>>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64
>>>>>>>> <total_exec_count>, metadata !"target_fn1ÃÂ¢ÃÂÃÂ, i64
>>>>>>>> <target_fn1_count>,
>>>>>>>> metadata !"target_fn2ÃÂ¢ÃÂÃÂ, i64 <target_fn2_count>,
>>>>>>>> ÃÂ¢ÃÂÃÂ¦.}
>>>>>>>>
>>>>>>>> Currently, we're recording only the top most called five function
>>>>>>>> names
>>>>>>>> at each indirect call site. Following the string literal
>>>>>>>> ÃÂ¢ÃÂÃÂindirect_call_targetsÃÂ¢ÃÂÃÂ are the fields
>>>>>>>> <total_exec_count>
>>>>>>>> i.e. a
>>>>>>>> 64
>>>>>>>> bit value for the total number of times the indirect call is
>>>>>>>> executed
>>>>>>>> followed by the function names and execution counts of each
>>>>>>>> target.
>>>>>>> This was the part I was trying to ask about.  I really want to see
>>>>>>> where
>>>>>>> you're going with this optimization wise.  My naive guess is that
>>>>>>> this
>>>>>>> is going to be slightly off for what you actually want.
>>>>>>>
>>>>>>> Assuming you're going for profile guided devirtualization (and thus
>>>>>>> inlining), being able to check the type of the receiver (as opposed
>>>>>>> to
>>>>>>> the result of the virtual lookup) might be advantageous.  (Or, to
>>>>>>> say
>>>>>>> it
>>>>>>> differently, that's what I'm used to seeing.  Your approach might
>>>>>>> be
>>>>>>> completely reasonable, it's just not what I'm used to seeing.)
>>>>>>> Have
>>>>>>> you
>>>>>>> thought about the tradeoffs here?
>>>>>> Not sure if I understood the problem here,
>>>>> First, I am not trying to say there is a problem with your approach;
>>>>> I
>>>>> am only saying that it's not what I would have expected based on past
>>>>> experience.  You may be entirely correct in your approach, you just
>>>>> need
>>>>> to convince me of that.  :)
>>>>>> however, we're recording both
>>>>>> the target address and the addresses/names of the instrumented
>>>>>> functions
>>>>>> during the execution of the instrumented binary. During profile
>>>>>> reading
>>>>>> these addresses are used to match the target addresses to
>>>>>> corresponding
>>>>>> functions.
>>>>> Ok, let's start from the basics.  For profile guided
>>>>> devirtualization,
>>>>> you're constructing a cache from (something) to function pointer and
>>>>> using that cache lookup to enable inlining of the hot target.  You
>>>>> have
>>>>> two standard choices on what to use as your cache key: the result of
>>>>> the
>>>>> virtual lookup and the inputs to the virtual lookup.
>>>>>
>>>>> Option 1 - Inputs to virtual lookup
>>>>> if ((receiver, vtable index) == what I predicted)
>>>>>  tartget_I_predicted(); // inline me!!
>>>>> else {
>>>>>  target = full virtual dispatch();
>>>>>  target();
>>>>> }
>>>>>
>>>>> Option 2 - result of virtual lookup
>>>>> target = full virtual dispatch();
>>>>> if ('target' == what I predicted)
>>>>>  tartget_I_predicted(); // inline me!!
>>>>> else {
>>>>>  target();
>>>>> }
>>>>>
>>>>> You seem to be proposing option 2.  I'm saying that I'm used to
>>>>> seeing
>>>>> option 1 used.  Both approaches have their appeal, I'm just asking
>>>>> you
>>>>> to explain *why* you've chosen the one you apparently have.
>>>>
>>>> Not all indirect calls occur from C++ like codes. We're profiling and
>>>> optimizing out indirect calls from C codes as well. We're seeing up to
>>>> 8%
>>>> gains on individual benchmarks in spec. This was measured on our
>>>> platform.
>>>
>>> We could also consider a hybrid that uses option 1 for vtable calls and
>>> option 2 for general function pointer calls. For cases where the code
>>> calls several virtual functions on the same object, profiling for
>>> option 1
>>> could be more efficient if we only record the type of the object once.
>>> I
>>> have no idea if that is worthwhile but itÃ¢ÂÂs another possibility.
>>>
>>> SPEC results are interesting, but IÃ¢ÂÂd be much more interested to
>>> hear
>>> how it works for clang. If you build clang with this profiling enabled,
>>> what is the time and memory overhead? How much bigger are the profile
>>> data
>>> files?
>>
>> We've instrumented clang using clang-tip. We collected profile data from
>> clang using one of the spec benchmarks under -O3. The benchmark was
>> composed of 16 files.
>>
>> Total size of raw profile files:
>>       	Original       IC-profiling        Increase
>>       4031666760       5063229256           %25.6
>>
>> Total size of merged profile file:
>>       Original       IC-profiling        Increase
>>       65681320         65973768            %.44
>
> Do you know why the raw files increase by 25% but the merged file
> doesnât increase at all? That is surprising to me.

Raw file is generated through appending of the raw profile data from clang
running 16 times for .o generation and once for linking. Therefore it
contains repetitions of sections for counter & name arrays and indirect
profile data entries. We had extended the per function data variable
(stored into the raw profile file), through adding the function pointer,
number of indirect call sites found in function and a pointer variable to
later hold indirect call target entries at runtime. These new fields also
contribute to the size difference.

>
>>
>> Average of three runs(using time):
>>       Original       IC-profiling        Increase
>>        47.38           55.73              %17.6
>
> What about the âplainâ profiling? The relevant number is how much the
> indirect call info improves performance beyond what you get with just the
> execution counts.

I'm sorry for not being clear perhaps. We have not yet provided any
numbers on the performance improvement due to IC-optimizations. The above
data compares the execution time of instrumented-clang (on the test
benchmark at -O3) using the current LLVM instrumentation vs the execution
time of the instrumented-clang using instrumentation patched w/ the
submitted IC profiling patches.

>
>>
>> The above numbers are collected from compiling the whole benchmark.
>> We've
>> the IC profile data collected from clang. If there is interest we can
>> share the data w/ the community.
>>
>> -Betul
>>
>>> If you then rebuild clang with PGO, how much does it speed things
>>> up?
>>>
>>>
>>
>>
>
>