[LLVMdev] IC profiling infrastructure

Wed Apr 29 10:26:24 PDT 2015

On Wed, Apr 29, 2015 at 10:19 AM,  <betulb at codeaurora.org> wrote:
>>> From: <betulb at codeaurora.org>
>>> Date: Tue, Apr 7, 2015 at 12:44 PM
>>> Subject: [LLVMdev] IC profiling infrastructure
>>> To: llvmdev at cs.uiuc.edu
>>>
>>>
>>>
>>> Hi All,
>>>
>>> We had sent out an RFC in October on indirect call target profiling. The
>>> proposal was about profiling target addresses seen at indirect call
>>> sites.
>>> Using the profile data we're seeing up to %8 performance improvements on
>>> individual spec benchmarks where indirect call sites are present. We've
>>> already started uploading our patches to the phabricator. I'm looking
>>> forward to your reviews and comments on the code and ready to respond to
>>> your design related queries.
>>>
>>> There were few questions posted on the RFC that were not responded. Here
>>> are the much delayed comments.
>>>
>>
>> Hi Betul, thank you for your patience.  I have completed initial
>> comparison with a few alternative value profile designs. My conclusion
>> is that your proposed approach should well in practice. The study can
>> be found here:
>> https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub
>
> Hi David,
>
> Thanks for the detailed report and working on this. We really appreciate
> the feedback. We're looking forward to the comments and up streaming the
> changes.
>
>>
>>> 1) Added dependencies: Our implementation adds dependency on calloc/free
>>> as we’re generating/maintaining a linked list at run time.
>>
>> If it becomes a problem for some, there is a way to handle that -- but
>> at a cost of more memory required (to be conservative). One of the
>> good feature of using dynamic memory is that it allows counter array
>> allocation on the fly which eliminates the need to allocate memory for
>> lots of cold/unexecuted functions.
>>
>>> We also added
>>> dependency on the usage of mutexes to prevent memory leaks in the case
>>> multiple threads trying to insert a new target address for the same IC
>>> site into the linked list. To least impact the performance we only added
>>> mutexes around the pointer assignment and kept any dynamic memory
>>> allocation/free operations outside of the mutexed code.
>>
>> This (using mutexes) should be and can be avoided -- see the above report.
>
> I did read your report carefully. You suggested use of atomic linked list
> link update to avoid mutexes. We have a runtime written in C. So I was not
> sure if introducing C++11 features like std::atomic was OK or not. Also
> some operations can be performed atomically on x86 platforms (based on
> data being aligned at various bit length/cache line boundaries) but arm or
> other platforms would not support that.

The suggestion is to use the atomic builtins -- see the review comments.

>
>>>
>>> 2) Indirect call data being present in sampling profile output: This is
>>> unfortunately not helping in our case due to perf depending on lbr
>>> support. To our knowledge lbr support is not present on ARM platforms.
>>>
>>
>> yes.
>>
>>> 3) Losing profiling support on targets not supporting malloc/mutexes:
>>> The
>>> added dependency on calloc/free/mutexes may perhaps be eliminated
>>> (although our current solution does not handle this) through having a
>>> separate run time library for value profiling purposes. Instrumentation
>>> can link in two run time libraries when value profiling (an instance of
>>> it
>>> being indirect call target profiling) is enabled on the command line.
>>
>> See above.
>>
>>>
>>> 4) Performance of the instrumented code: Instrumentation with IC
>>> profiling
>>> patches resulted in 7% degradation across spec benchmarks at -O2. For
>>> the
>>> benchmarks that did not have any IC sites, no performance degradation
>>> was
>>> observed. This data is gathered using the ref data set for spec.
>>>
>>
>> I'd like to make the runtime part of the change to be shared and used
>> as a general purpose value profiler (not just indirect call
>> promotion), but this can be done as a follow up.
>
> My understanding of your analysis was that it only covered the run-time
> library performance and not really looked into if instrumentation is
> really enabled at the right sites.

It was mainly focusing on the runtime library performance.

>
>> I will start with some reviews. Hopefully others will help with reviews
>> too.

I looked through one patch and sent the comments.

David

>
> Thanks very much. We'll be responding to the reviews diligently.
>
>> thanks,
>>
>> David
>>
>>
>>
>>> Thanks,
>>> -Betul Buyukkurt
>>>
>>> Qualcomm Innovation Center, Inc.
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
>>> Linux
>>> Foundation Collaborative Project
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>
>