[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Thu Apr 17 09:37:34 PDT 2014

How about per-thread if the counter is hot enough?

Jon

On 4/17/14, 7:13 AM, Kostya Serebryany wrote:
>
>
>
> On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com
> <mailto:yaron.keren at gmail.com>> wrote:
>
>     If accuracy is not critical, incrementing the counters without any guards
>     might be good enough.
>
>
> No.  Contention on the counters leads to 5x-10x slowdown. This is never good
> enough.
>
> --kcc
>
>     Hot areas will still be hot and cold areas will not be affected.
>
>     Yaron
>
>
>
>     2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com
>     <mailto:kcc at google.com>>:
>
>         Hi,
>
>         The current design of -fprofile-instr-generate has the same fundamental
>         flaw
>         as the old gcc's gcov instrumentation: it has contention on counters.
>         A trivial synthetic test case was described here:
>         http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html
>
>         For the problem to appear we need to have a hot function that is
>         simultaneously executed
>         by multiple threads -- then we will have high contention on the racy
>         profile counters.
>
>         Such situation is not necessary very frequent, but when it happens
>         -fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)
>
>         An example is the multi-threaded vp9 video encoder.
>
>         git clone https://chromium.googlesource.com/webm/libvpx
>         cd libvpx/
>         F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F"
>         CXX="clang++ $F" LD="clang++ $F" ./configure
>         make -j32
>         # get sample video from from
>         https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
>         time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m
>
>         When running single-threaded, -fprofile-instr-generate adds reasonable
>         ~15% overhead
>         (8.5 vs 10 seconds)
>         When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).
>
>         I am not saying that this flaw is a showstopper, but with the continued move
>         towards multithreading it will be hurting more and more users of
>         coverage and PGO.
>         AFAICT, most of our PGO users simply can not run their software in
>         single-threaded mode,
>         and some of them surely have hot functions running in all threads at once.
>
>         At the very least we should document this problem, but better try fixing
>         it.
>
>         Some ideas:
>
>         - per-thread counters. Solves the problem at huge cost in RAM per-thread
>         - 8-bit per-thread counters, dumping into central counters on overflow.
>         - per-cpu counters (not portable, requires very modern kernel with lots
>         of patches)
>         - sharded counters: each counter represented as N counters sitting in
>         different cache lines. Every thread accesses the counter with index
>         TID%N. Solves the problem partially, better with larger values of N, but
>         then again it costs RAM.
>         - reduce contention on hot counters by not incrementing them if they are
>         big enough:
>             {if (counter < 65536) counter++}; This reduces the accuracy though.
>         Is that bad for PGO?
>         - self-cooling logarithmic counters: if ((fast_random() % (1 <<
>         counter)) == 0) counter++;
>
>         Other thoughts?
>
>         --kcc
>
>
>
>
>         _______________________________________________
>         LLVM Developers mailing list
>         LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu
>         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

-- 
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded