[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Thu Apr 17 10:38:27 PDT 2014

On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at google.com> wrote:

> Hi,
>
> The current design of -fprofile-instr-generate has the same fundamental
> flaw
> as the old gcc's gcov instrumentation: it has contention on counters.
> A trivial synthetic test case was described here:
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html
>
> For the problem to appear we need to have a hot function that is
> simultaneously executed
> by multiple threads -- then we will have high contention on the racy
> profile counters.
>
> Such situation is not necessary very frequent, but when it happens
> -fprofile-instr-generate becomes barely usable due to huge slowdown
> (5x-10x)
>

We have seen even larger slowdowns, but it is uncommon, nor have I heard
many complaints about it.

>
> An example is the multi-threaded vp9 video encoder.
>
> git clone https://chromium.googlesource.com/webm/libvpx
> cd libvpx/
> F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++
> $F" LD="clang++ $F" ./configure
> make -j32
> # get sample video from from
> https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
> time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m
>
> When running single-threaded, -fprofile-instr-generate adds reasonable
> ~15% overhead
> (8.5 vs 10 seconds)
> When running with 8 threads, it has 7x overhead (3.5 seconds vs 26
> seconds).
>
> I am not saying that this flaw is a showstopper, but with the continued
> move
> towards multithreading it will be hurting more and more users of coverage
> and PGO.
> AFAICT, most of our PGO users simply can not run their software in
> single-threaded mode,
> and some of them surely have hot functions running in all threads at once.
>
> At the very least we should document this problem, but better try fixing
> it.
>
>
Users can also select smaller (but still representative) training data set
to solve the problem.

> Some ideas:
>
> - per-thread counters. Solves the problem at huge cost in RAM per-thread
>

It is not practical. Especially for TLS counters -- it creates huge
pressure on stack memory.

> - 8-bit per-thread counters, dumping into central counters on overflow.
>

The overflow will happen very quickly with 8bit counter.

> - per-cpu counters (not portable, requires very modern kernel with lots of
> patches)
> - sharded counters: each counter represented as N counters sitting in
> different cache lines. Every thread accesses the counter with index TID%N.
> Solves the problem partially, better with larger values of N, but then
> again it costs RAM.
>

Interesting idea. This may work well.

> - reduce contention on hot counters by not incrementing them if they are
> big enough:
>    {if (counter < 65536) counter++}; This reduces the accuracy though. Is
> that bad for PGO?
>

yes, it will be bad.

> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter))
> == 0) counter++;
>
>
Another idea is to use stack local counters per function -- synced up with
global counters on entry and exit. the problem with it is for deeply
recursive calls, stack pressure can be too high.

In Google GCC, we implemented another technique which proves to be very
effective -- it is called FDO sampling. Basically counters will be updated
every N samples.

David

> Other thoughts?
>
> --kcc
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/8b0f6b2b/attachment.html>