[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Jonathan Roelofs
jonathan at codesourcery.com
Thu Apr 17 09:37:34 PDT 2014
How about per-thread if the counter is hot enough?
Jon
On 4/17/14, 7:13 AM, Kostya Serebryany wrote:
>
>
>
> On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com
> <mailto:yaron.keren at gmail.com>> wrote:
>
> If accuracy is not critical, incrementing the counters without any guards
> might be good enough.
>
>
> No. Contention on the counters leads to 5x-10x slowdown. This is never good
> enough.
>
> --kcc
>
> Hot areas will still be hot and cold areas will not be affected.
>
> Yaron
>
>
>
> 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com
> <mailto:kcc at google.com>>:
>
> Hi,
>
> The current design of -fprofile-instr-generate has the same fundamental
> flaw
> as the old gcc's gcov instrumentation: it has contention on counters.
> A trivial synthetic test case was described here:
> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html
>
> For the problem to appear we need to have a hot function that is
> simultaneously executed
> by multiple threads -- then we will have high contention on the racy
> profile counters.
>
> Such situation is not necessary very frequent, but when it happens
> -fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)
>
> An example is the multi-threaded vp9 video encoder.
>
> git clone https://chromium.googlesource.com/webm/libvpx
> cd libvpx/
> F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F"
> CXX="clang++ $F" LD="clang++ $F" ./configure
> make -j32
> # get sample video from from
> https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
> time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m
>
> When running single-threaded, -fprofile-instr-generate adds reasonable
> ~15% overhead
> (8.5 vs 10 seconds)
> When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).
>
> I am not saying that this flaw is a showstopper, but with the continued move
> towards multithreading it will be hurting more and more users of
> coverage and PGO.
> AFAICT, most of our PGO users simply can not run their software in
> single-threaded mode,
> and some of them surely have hot functions running in all threads at once.
>
> At the very least we should document this problem, but better try fixing
> it.
>
> Some ideas:
>
> - per-thread counters. Solves the problem at huge cost in RAM per-thread
> - 8-bit per-thread counters, dumping into central counters on overflow.
> - per-cpu counters (not portable, requires very modern kernel with lots
> of patches)
> - sharded counters: each counter represented as N counters sitting in
> different cache lines. Every thread accesses the counter with index
> TID%N. Solves the problem partially, better with larger values of N, but
> then again it costs RAM.
> - reduce contention on hot counters by not incrementing them if they are
> big enough:
> {if (counter < 65536) counter++}; This reduces the accuracy though.
> Is that bad for PGO?
> - self-cooling logarithmic counters: if ((fast_random() % (1 <<
> counter)) == 0) counter++;
>
> Other thoughts?
>
> --kcc
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
--
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded
More information about the llvm-dev
mailing list