[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Thu Apr 17 11:50:42 PDT 2014

On Thu, Apr 17, 2014 at 11:47 AM, Bob Wilson <bob.wilson at apple.com> wrote:

>
> On Apr 17, 2014, at 11:41 AM, Chandler Carruth <chandlerc at google.com>
> wrote:
>
>
> On Thu, Apr 17, 2014 at 11:22 AM, Bob Wilson <bob.wilson at apple.com> wrote:
>
>> On Apr 17, 2014, at 11:09 AM, Xinliang David Li <xinliangli at gmail.com>
>> wrote:
>>
>>
>> On Thu, Apr 17, 2014 at 10:58 AM, Duncan P. N. Exon Smith <
>> dexonsmith at apple.com> wrote:
>>
>>>
>>> On 2014-Apr-17, at 10:38, Xinliang David Li <xinliangli at gmail.com>
>>> wrote:
>>>
>>> >
>>> > Another idea is to use stack local counters per function -- synced up
>>> with global counters on entry and exit. the problem with it is for deeply
>>> recursive calls, stack pressure can be too high.
>>>
>>> I think they'd need to be synced with global counters before function
>>> calls as well, since any function call can call "exit()".
>>>
>>
>> right -- but it might be better to handle this in other ways. For
>> instance a stack of counters for each frames is maintained. At exit, they
>> are flushed in a batch. Or simply ignore it in case of program exit .
>>
>>
>> It seems to me like we’re going to have a hard time getting good
>> multithreaded performance without significant impact on the single-threaded
>> behavior. We might need to add an option to choose between those. There’s a
>> lot of room for improvement in the performance with the current
>> instrumentation, so maybe we can find a way to make things incrementally
>> better in a way that helps both, but avoiding the multithreaded cache
>> conflicts seems like it’s going to be expensive in other ways.
>>
>
> I don't really agree.
>
> First, multithreaded applications are going to be the majority soon, even
> if they aren't already. We should design for them and support them well by
> default. If, once we have that, we find single threaded performance
> dramatically suffers, then maybe we should add a flag. But it doesn't make
> sense to do this before we even have data.
>
>
> If someone wants to revise the instrumentation in a way that works better
> for multithreaded code, that’s great. Before the change is committed, we
> should have performance data comparing it to the current code. If there is
> no regression, then fine. If it significantly hurts single-threaded
> performance, then we will need a flag.
>

If you want to default Darwin into slow multithreaded instrumentation to
make single threaded instrumentation faster, go for it. That is not the
correct default for Clang as a project or the Linux port though.

Anyways, all of this is somewhat moot. We really need the *architecture* of
PGO instrumentation to not be multiple times slower for multithreaded
applications. I think this is critical limitation of the current design.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/491ef379/attachment.html>