[PATCH] Indirect call target profiling related profile reader/writer changes
betulb at codeaurora.org
betulb at codeaurora.org
Thu Apr 23 10:34:09 PDT 2015
>
>> On Apr 14, 2015, at 11:56 AM, betulb at codeaurora.org wrote:
>>
>>>
>>> On 04/10/2015 09:25 AM, betulb at codeaurora.org wrote:
>>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:
>>>>>> In http://reviews.llvm.org/D8908#153838, @reames wrote:
>>>>>>
>>>>>>> Have the IR level construct patches made it up for review? If so,
>>>>>>> can
>>>>>> So far I've posted two patches. These two patches should apply
>>>>>> cleanly
>>>>>> to the tip, working with the present profile infrastructure. The
>>>>>> next
>>>>>> set of patches will be the enabler ones: i.e. three more patches one
>>>>>> for
>>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for
>>>>>> review
>>>>>> later today.
>>>>>>
>>>>>>> you send me a link? I managed to miss them.
>>>>>> So far there is this patch and the instrinsic instruction
>>>>>> definitions:
>>>>>> http://reviews.llvm.org/D8877. All patches are necessary for getting
>>>>>> the
>>>>>> IC targets and having them displayed by the llvm-profdata.
>>>>> Ok, I'm really not convinced that the instrumentation code needs to
>>>>> be
>>>>> or should be an intrinsic. This seems like something which should be
>>>>> emitted by the frontend and optimized like any other code. To say
>>>>> this
>>>>> a different way, my instrumentation is going to be entirely different
>>>>> than your instrumentation.
>>>>>
>>>>> Having said that, I really don't care about this part of the proposed
>>>>> changes since they aren't going to impact me at all. I'm am
>>>>> specifically not objecting to the changes, just commenting. :)
>>>>>>> I'm assuming this will be some type of per call site metadata?
>>>>>> We do assign metadata at the indirect call sites. Format looks like
>>>>>> as
>>>>>> follows:
>>>>>>
>>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64
>>>>>> <total_exec_count>, metadata !"target_fn1âÂÂ, i64
>>>>>> <target_fn1_count>,
>>>>>> metadata !"target_fn2âÂÂ, i64 <target_fn2_count>, â¦.}
>>>>>>
>>>>>> Currently, we're recording only the top most called five function
>>>>>> names
>>>>>> at each indirect call site. Following the string literal
>>>>>> âÂÂindirect_call_targetsâ are the fields <total_exec_count>
>>>>>> i.e. a
>>>>>> 64
>>>>>> bit value for the total number of times the indirect call is
>>>>>> executed
>>>>>> followed by the function names and execution counts of each target.
>>>>> This was the part I was trying to ask about. I really want to see
>>>>> where
>>>>> you're going with this optimization wise. My naive guess is that
>>>>> this
>>>>> is going to be slightly off for what you actually want.
>>>>>
>>>>> Assuming you're going for profile guided devirtualization (and thus
>>>>> inlining), being able to check the type of the receiver (as opposed
>>>>> to
>>>>> the result of the virtual lookup) might be advantageous. (Or, to say
>>>>> it
>>>>> differently, that's what I'm used to seeing. Your approach might be
>>>>> completely reasonable, it's just not what I'm used to seeing.) Have
>>>>> you
>>>>> thought about the tradeoffs here?
>>>> Not sure if I understood the problem here,
>>> First, I am not trying to say there is a problem with your approach; I
>>> am only saying that it's not what I would have expected based on past
>>> experience. You may be entirely correct in your approach, you just
>>> need
>>> to convince me of that. :)
>>>> however, we're recording both
>>>> the target address and the addresses/names of the instrumented
>>>> functions
>>>> during the execution of the instrumented binary. During profile
>>>> reading
>>>> these addresses are used to match the target addresses to
>>>> corresponding
>>>> functions.
>>> Ok, let's start from the basics. For profile guided devirtualization,
>>> you're constructing a cache from (something) to function pointer and
>>> using that cache lookup to enable inlining of the hot target. You have
>>> two standard choices on what to use as your cache key: the result of
>>> the
>>> virtual lookup and the inputs to the virtual lookup.
>>>
>>> Option 1 - Inputs to virtual lookup
>>> if ((receiver, vtable index) == what I predicted)
>>> tartget_I_predicted(); // inline me!!
>>> else {
>>> target = full virtual dispatch();
>>> target();
>>> }
>>>
>>> Option 2 - result of virtual lookup
>>> target = full virtual dispatch();
>>> if ('target' == what I predicted)
>>> tartget_I_predicted(); // inline me!!
>>> else {
>>> target();
>>> }
>>>
>>> You seem to be proposing option 2. I'm saying that I'm used to seeing
>>> option 1 used. Both approaches have their appeal, I'm just asking you
>>> to explain *why* you've chosen the one you apparently have.
>>
>> Not all indirect calls occur from C++ like codes. We're profiling and
>> optimizing out indirect calls from C codes as well. We're seeing up to
>> 8%
>> gains on individual benchmarks in spec. This was measured on our
>> platform.
>
> We could also consider a hybrid that uses option 1 for vtable calls and
> option 2 for general function pointer calls. For cases where the code
> calls several virtual functions on the same object, profiling for option 1
> could be more efficient if we only record the type of the object once. I
> have no idea if that is worthwhile but itâs another possibility.
>
> SPEC results are interesting, but Iâd be much more interested to hear
> how it works for clang. If you build clang with this profiling enabled,
> what is the time and memory overhead? How much bigger are the profile data
> files?
We've instrumented clang using clang-tip. We collected profile data from
clang using one of the spec benchmarks under -O3. The benchmark was
composed of 16 files.
Total size of raw profile files:
Original IC-profiling Increase
4031666760 5063229256 %25.6
Total size of merged profile file:
Original IC-profiling Increase
65681320 65973768 %.44
Average of three runs(using time):
Original IC-profiling Increase
47.38 55.73 %17.6
The above numbers are collected from compiling the whole benchmark. We've
the IC profile data collected from clang. If there is interest we can
share the data w/ the community.
-Betul
> If you then rebuild clang with PGO, how much does it speed things
> up?
>
>
More information about the llvm-commits
mailing list