[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 18 02:21:13 PDT 2014

On Fri, Apr 18, 2014 at 2:10 AM, Kostya Serebryany <kcc at google.com> wrote:

> One more proposal: simple per-thread counters allocated with
> mmap(MAP_NORESERVE), the same trick that works so well for asan/tsan/msan.
>
> Chrome has ~3M basic blocks instrumented for coverage,
> so even largest applications will hardly have more than, say, 10M basic
> blocks
>

I think this is a *gross* underestimation. I work with applications more
than one order of magnitude larger than Chrome.

> (number can be configurable at application start time). This gives us 80Mb
> for the array of 64-bit counters.
> That's a lot if multiplied by the number of threads, but the MAP_NORESERVE
> trick solves the problem --
> each thread will only touch the pages where it actually increment the
> counters.
> On thread exit the whole 80Mb counter array are will be merged into a
> central array of counters and then discarded,
> but we can also postpone this until another new thread is created -- then
> we just reuse the counter array.
>
> This brings two challenges.
>
> #1. The basic blocks should be numbered sequentially. I see only one way
> to accomplish this: with the help of linker (and dynamic linker for DSOs).
> The compiler would emit code using offsets that will later be transformed
> into constants by the linker.
> Not sure if any existing linker support this kind of thing. Anyone?
>
> #2. How to access the per-thread counter array. If we simply store the
> pointer to the array in TLS, the instrumentation will be more expensive
> just because of need to load and keep this pointer.
> If the counter array is part of TLS itself, we'll have to intrude into the
> pthread library (or wrap it) so that this part of TLS is mapped with
> MAP_NORESERVE.
>

#3. It essentially *requires* a complex merge on shutdown rather than a
simple flush. I'm not even sure how to do the merge without dirtying still
more pages of the no-reserve memory.

It's not at all clear to me that this scales up (either in memory usage,
memory reservation, or shutdown time) to larger applications. Chrome isn't
a useful upper bound here.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/cb55e142/attachment.html>