[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Dmitry Vyukov dvyukov at google.com
Fri Apr 18 02:30:01 PDT 2014


On Fri, Apr 18, 2014 at 1:21 PM, Chandler Carruth <chandlerc at google.com>wrote:

> On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <kcc at google.com> wrote:
>
>> One more proposal: simple per-thread counters allocated with
>> mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.
>>
>> Chrome has ~3M basic blocks instrumented for coverage,
>> so even largest applications will hardly have more than, say, 10M basic
>> blocks
>>
>
> I think this is a *gross* underestimation. I work with applications more
> than one order of magnitude larger than Chrome.
>
>
>> (number can be configurable at application start time). This gives us
>> 80Mb for the array of 64-bit counters.
>> That's a lot if multiplied by the number of threads, but the
>> MAP_NORESERVE trick solves the problem --
>> each thread will only touch the pages where it actually increment the
>> counters.
>> On thread exit the whole 80Mb counter array are will be merged into a
>> central array of counters and then discarded,
>> but we can also postpone this until another new thread is created -- then
>> we just reuse the counter array.
>>
>> This brings two challenges.
>>
>> #1. The basic blocks should be numbered sequentially. I see only one way
>> to accomplish this: with the help of linker (and dynamic linker for DSOs).
>> The compiler would emit code using offsets that will later be transformed
>> into constants by the linker.
>> Not sure if any existing linker support this kind of thing. Anyone?
>>
>> #2. How to access the per-thread counter array. If we simply store the
>> pointer to the array in TLS, the instrumentation will be more expensive
>> just because of need to load and keep this pointer.
>> If the counter array is part of TLS itself, we'll have to intrude into
>> the pthread library (or wrap it) so that this part of TLS is mapped with
>> MAP_NORESERVE.
>>
>
> #3. It essentially *requires* a complex merge on shutdown rather than a
> simple flush. I'm not even sure how to do the merge without dirtying still
> more pages of the no-reserve memory.
>
>
> It's not at all clear to me that this scales up (either in memory usage,
> memory reservation, or shutdown time) to larger applications. Chrome isn't
> a useful upper bound here.
>

Array processing is fast. Contention is slow. I would expect this to be a
net win.
For the additional memory consumption during final merge, we can process
one per-thread array, unmap it, process second array, unmap it, and so on.
This will not require bringing all the pages into memory.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/a6228a73/attachment.html>


More information about the llvm-dev mailing list