[PATCH] Indirect call target profiling related profile reader/writer changes

Thu Apr 23 15:05:56 PDT 2015

I had some concerns about the profiler, so I did quite a bit of comparison
of the the performance of this implementation with a couple of alternatives.

The good news is that from the data I collected so far, this implementation
looks promising and competitive (and there are ways to further reduce the
overhead).   I still have a couple of more experiments to do, after which I
will share my results and explanation.

Thanks for working on this.

David

On Thu, Apr 23, 2015 at 10:34 AM, <betulb at codeaurora.org> wrote:

> >
> >> On Apr 14, 2015, at 11:56 AM, betulb at codeaurora.org wrote:
> >>
> >>>
> >>> On 04/10/2015 09:25 AM, betulb at codeaurora.org wrote:
> >>>>> On 04/09/2015 11:06 AM, Betul Buyukkurt wrote:
> >>>>>> In http://reviews.llvm.org/D8908#153838, @reames wrote:
> >>>>>>
> >>>>>>> Have the IR level construct patches made it up for review?  If so,
> >>>>>>> can
> >>>>>> So far I've posted two patches. These two patches should apply
> >>>>>> cleanly
> >>>>>> to the tip, working with the present profile infrastructure. The
> >>>>>> next
> >>>>>> set of patches will be the enabler ones: i.e. three more patches one
> >>>>>> for
> >>>>>> each of clang, llvm and compiler-rt. Clang patch will be up for
> >>>>>> review
> >>>>>> later today.
> >>>>>>
> >>>>>>> you send me a link?  I managed to miss them.
> >>>>>> So far there is this patch and the instrinsic instruction
> >>>>>> definitions:
> >>>>>> http://reviews.llvm.org/D8877. All patches are necessary for
> getting
> >>>>>> the
> >>>>>> IC targets and having them displayed by the llvm-profdata.
> >>>>> Ok, I'm really not convinced that the instrumentation code needs to
> >>>>> be
> >>>>> or should be an intrinsic.  This seems like something which should be
> >>>>> emitted by the frontend and optimized like any other code.  To say
> >>>>> this
> >>>>> a different way, my instrumentation is going to be entirely different
> >>>>> than your instrumentation.
> >>>>>
> >>>>> Having said that, I really don't care about this part of the proposed
> >>>>> changes since they aren't going to impact me at all.  I'm am
> >>>>> specifically not objecting to the changes, just commenting.  :)
> >>>>>>> I'm assuming this will be some type of per call site metadata?
> >>>>>> We do assign metadata at the indirect call sites. Format looks like
> >>>>>> as
> >>>>>> follows:
> >>>>>>
> >>>>>> !33 = metadata !{metadata !"indirect_call_targets", i64
> >>>>>> <total_exec_count>, metadata !"target_fn1Ã¢Â€Â , i64
> >>>>>> <target_fn1_count>,
> >>>>>> metadata !"target_fn2Ã¢Â€Â , i64 <target_fn2_count>, Ã¢Â€Â¦.}
> >>>>>>
> >>>>>> Currently, we're recording only the top most called five function
> >>>>>> names
> >>>>>> at each indirect call site. Following the string literal
> >>>>>> Ã¢Â€Âœindirect_call_targetsÃ¢Â€Â  are the fields  <total_exec_count>
> >>>>>> i.e. a
> >>>>>> 64
> >>>>>> bit value for the total number of times the indirect call is
> >>>>>> executed
> >>>>>> followed by the function names and execution counts of each target.
> >>>>> This was the part I was trying to ask about.  I really want to see
> >>>>> where
> >>>>> you're going with this optimization wise.  My naive guess is that
> >>>>> this
> >>>>> is going to be slightly off for what you actually want.
> >>>>>
> >>>>> Assuming you're going for profile guided devirtualization (and thus
> >>>>> inlining), being able to check the type of the receiver (as opposed
> >>>>> to
> >>>>> the result of the virtual lookup) might be advantageous.  (Or, to say
> >>>>> it
> >>>>> differently, that's what I'm used to seeing.  Your approach might be
> >>>>> completely reasonable, it's just not what I'm used to seeing.)  Have
> >>>>> you
> >>>>> thought about the tradeoffs here?
> >>>> Not sure if I understood the problem here,
> >>> First, I am not trying to say there is a problem with your approach; I
> >>> am only saying that it's not what I would have expected based on past
> >>> experience.  You may be entirely correct in your approach, you just
> >>> need
> >>> to convince me of that.  :)
> >>>> however, we're recording both
> >>>> the target address and the addresses/names of the instrumented
> >>>> functions
> >>>> during the execution of the instrumented binary. During profile
> >>>> reading
> >>>> these addresses are used to match the target addresses to
> >>>> corresponding
> >>>> functions.
> >>> Ok, let's start from the basics.  For profile guided devirtualization,
> >>> you're constructing a cache from (something) to function pointer and
> >>> using that cache lookup to enable inlining of the hot target.  You have
> >>> two standard choices on what to use as your cache key: the result of
> >>> the
> >>> virtual lookup and the inputs to the virtual lookup.
> >>>
> >>> Option 1 - Inputs to virtual lookup
> >>> if ((receiver, vtable index) == what I predicted)
> >>>   tartget_I_predicted(); // inline me!!
> >>> else {
> >>>   target = full virtual dispatch();
> >>>   target();
> >>> }
> >>>
> >>> Option 2 - result of virtual lookup
> >>> target = full virtual dispatch();
> >>> if ('target' == what I predicted)
> >>>   tartget_I_predicted(); // inline me!!
> >>> else {
> >>>   target();
> >>> }
> >>>
> >>> You seem to be proposing option 2.  I'm saying that I'm used to seeing
> >>> option 1 used.  Both approaches have their appeal, I'm just asking you
> >>> to explain *why* you've chosen the one you apparently have.
> >>
> >> Not all indirect calls occur from C++ like codes. We're profiling and
> >> optimizing out indirect calls from C codes as well. We're seeing up to
> >> 8%
> >> gains on individual benchmarks in spec. This was measured on our
> >> platform.
> >
> > We could also consider a hybrid that uses option 1 for vtable calls and
> > option 2 for general function pointer calls. For cases where the code
> > calls several virtual functions on the same object, profiling for option
> 1
> > could be more efficient if we only record the type of the object once. I
> > have no idea if that is worthwhile but itâ€™s another possibility.
> >
> > SPEC results are interesting, but Iâ€™d be much more interested to hear
> > how it works for clang. If you build clang with this profiling enabled,
> > what is the time and memory overhead? How much bigger are the profile
> data
> > files?
>
> We've instrumented clang using clang-tip. We collected profile data from
> clang using one of the spec benchmarks under -O3. The benchmark was
> composed of 16 files.
>
> Total size of raw profile files:
>         Original       IC-profiling        Increase
>        4031666760       5063229256           %25.6
>
> Total size of merged profile file:
>        Original       IC-profiling        Increase
>        65681320         65973768            %.44
>
> Average of three runs(using time):
>        Original       IC-profiling        Increase
>         47.38           55.73              %17.6
>
> The above numbers are collected from compiling the whole benchmark. We've
> the IC profile data collected from clang. If there is interest we can
> share the data w/ the community.
>
> -Betul
>
> > If you then rebuild clang with PGO, how much does it speed things
> > up?
> >
> >
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150423/f538d3e8/attachment.html>