[PATCH] Indirect call target profiling related profile reader/writer changes

Thu Apr 23 17:20:27 PDT 2015

On Thu, Apr 23, 2015 at 3:56 PM, Bob Wilson <bob.wilson at apple.com> wrote:

>
> > On Apr 23, 2015, at 10:34 AM, betulb at codeaurora.org wrote:
> >
> >>
> >>> On Apr 14, 2015, at 11:56 AM, betulb at codeaurora.org wrote:
> >>>
> >>>>
> >>>> On 04/10/2015 09:25 AM, betulb at codeaurora.org wrote:
> >>>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:
> >>>>>>> In http://reviews.llvm.org/D8908#153838, @reames wrote:
> >>>>>>>
> >>>>>>>> Have the IR level construct patches made it up for review?  If so,
> >>>>>>>> can
> >>>>>>> So far I've posted two patches. These two patches should apply
> >>>>>>> cleanly
> >>>>>>> to the tip, working with the present profile infrastructure. The
> >>>>>>> next
> >>>>>>> set of patches will be the enabler ones: i.e. three more patches
> one
> >>>>>>> for
> >>>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for
> >>>>>>> review
> >>>>>>> later today.
> >>>>>>>
> >>>>>>>> you send me a link?  I managed to miss them.
> >>>>>>> So far there is this patch and the instrinsic instruction
> >>>>>>> definitions:
> >>>>>>> http://reviews.llvm.org/D8877. All patches are necessary for
> getting
> >>>>>>> the
> >>>>>>> IC targets and having them displayed by the llvm-profdata.
> >>>>>> Ok, I'm really not convinced that the instrumentation code needs to
> >>>>>> be
> >>>>>> or should be an intrinsic.  This seems like something which should
> be
> >>>>>> emitted by the frontend and optimized like any other code.  To say
> >>>>>> this
> >>>>>> a different way, my instrumentation is going to be entirely
> different
> >>>>>> than your instrumentation.
> >>>>>>
> >>>>>> Having said that, I really don't care about this part of the
> proposed
> >>>>>> changes since they aren't going to impact me at all.  I'm am
> >>>>>> specifically not objecting to the changes, just commenting.  :)
> >>>>>>>> I'm assuming this will be some type of per call site metadata?
> >>>>>>> We do assign metadata at the indirect call sites. Format looks like
> >>>>>>> as
> >>>>>>> follows:
> >>>>>>>
> >>>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64
> >>>>>>> <total_exec_count>, metadata !"target_fn1Ã¢Â€Â , i64
> >>>>>>> <target_fn1_count>,
> >>>>>>> metadata !"target_fn2Ã¢Â€Â , i64 <target_fn2_count>, Ã¢Â€Â¦.}
> >>>>>>>
> >>>>>>> Currently, we're recording only the top most called five function
> >>>>>>> names
> >>>>>>> at each indirect call site. Following the string literal
> >>>>>>> Ã¢Â€Âœindirect_call_targetsÃ¢Â€Â  are the fields
> <total_exec_count>
> >>>>>>> i.e. a
> >>>>>>> 64
> >>>>>>> bit value for the total number of times the indirect call is
> >>>>>>> executed
> >>>>>>> followed by the function names and execution counts of each target.
> >>>>>> This was the part I was trying to ask about.  I really want to see
> >>>>>> where
> >>>>>> you're going with this optimization wise.  My naive guess is that
> >>>>>> this
> >>>>>> is going to be slightly off for what you actually want.
> >>>>>>
> >>>>>> Assuming you're going for profile guided devirtualization (and thus
> >>>>>> inlining), being able to check the type of the receiver (as opposed
> >>>>>> to
> >>>>>> the result of the virtual lookup) might be advantageous.  (Or, to
> say
> >>>>>> it
> >>>>>> differently, that's what I'm used to seeing.  Your approach might be
> >>>>>> completely reasonable, it's just not what I'm used to seeing.)  Have
> >>>>>> you
> >>>>>> thought about the tradeoffs here?
> >>>>> Not sure if I understood the problem here,
> >>>> First, I am not trying to say there is a problem with your approach; I
> >>>> am only saying that it's not what I would have expected based on past
> >>>> experience.  You may be entirely correct in your approach, you just
> >>>> need
> >>>> to convince me of that.  :)
> >>>>> however, we're recording both
> >>>>> the target address and the addresses/names of the instrumented
> >>>>> functions
> >>>>> during the execution of the instrumented binary. During profile
> >>>>> reading
> >>>>> these addresses are used to match the target addresses to
> >>>>> corresponding
> >>>>> functions.
> >>>> Ok, let's start from the basics.  For profile guided devirtualization,
> >>>> you're constructing a cache from (something) to function pointer and
> >>>> using that cache lookup to enable inlining of the hot target.  You
> have
> >>>> two standard choices on what to use as your cache key: the result of
> >>>> the
> >>>> virtual lookup and the inputs to the virtual lookup.
> >>>>
> >>>> Option 1 - Inputs to virtual lookup
> >>>> if ((receiver, vtable index) == what I predicted)
> >>>>  tartget_I_predicted(); // inline me!!
> >>>> else {
> >>>>  target = full virtual dispatch();
> >>>>  target();
> >>>> }
> >>>>
> >>>> Option 2 - result of virtual lookup
> >>>> target = full virtual dispatch();
> >>>> if ('target' == what I predicted)
> >>>>  tartget_I_predicted(); // inline me!!
> >>>> else {
> >>>>  target();
> >>>> }
> >>>>
> >>>> You seem to be proposing option 2.  I'm saying that I'm used to seeing
> >>>> option 1 used.  Both approaches have their appeal, I'm just asking you
> >>>> to explain *why* you've chosen the one you apparently have.
> >>>
> >>> Not all indirect calls occur from C++ like codes. We're profiling and
> >>> optimizing out indirect calls from C codes as well. We're seeing up to
> >>> 8%
> >>> gains on individual benchmarks in spec. This was measured on our
> >>> platform.
> >>
> >> We could also consider a hybrid that uses option 1 for vtable calls and
> >> option 2 for general function pointer calls. For cases where the code
> >> calls several virtual functions on the same object, profiling for
> option 1
> >> could be more efficient if we only record the type of the object once. I
> >> have no idea if that is worthwhile but itâ€™s another possibility.
> >>
> >> SPEC results are interesting, but Iâ€™d be much more interested to hear
> >> how it works for clang. If you build clang with this profiling enabled,
> >> what is the time and memory overhead? How much bigger are the profile
> data
> >> files?
> >
> > We've instrumented clang using clang-tip. We collected profile data from
> > clang using one of the spec benchmarks under -O3. The benchmark was
> > composed of 16 files.
> >
> > Total size of raw profile files:
> >               Original       IC-profiling        Increase
> >       4031666760       5063229256           %25.6
> >
> > Total size of merged profile file:
> >       Original       IC-profiling        Increase
> >       65681320         65973768            %.44
>
> Do you know why the raw files increase by 25% but the merged file doesn’t
> increase at all? That is surprising to me.
>
> >
> > Average of three runs(using time):
> >       Original       IC-profiling        Increase
> >        47.38           55.73              %17.6
>
> What about the “plain” profiling? The relevant number is how much the
> indirect call info improves performance beyond what you get with just the
> execution counts.
>

Clang, unfortunately does not benefit from indirect call promotion
transformation with the testing I have done (using GCC's PGO).   Out of the
top 10 hottest indirect callsites, the second hottest has 39 targets. Among
them there are also ones with 15, 29, and 9 targets with no clear
dominating one.  The hottest callsite has only one target, its function is
a wrapper to another large function which is probably not a candidate for
inlining.

Though not helping clang, IC promotion does help real world apps other than
SPEC. We have seen improvements in the range of 2-4%.

David

>
> >
> > The above numbers are collected from compiling the whole benchmark. We've
> > the IC profile data collected from clang. If there is interest we can
> > share the data w/ the community.
> >
> > -Betul
> >
> >> If you then rebuild clang with PGO, how much does it speed things
> >> up?
> >>
> >>
> >
> >
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150423/094d7404/attachment.html>