[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Thu Apr 17 07:13:58 PDT 2014

On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com> wrote:

> If accuracy is not critical, incrementing the counters without any guards
> might be good enough.
>

No.  Contention on the counters leads to 5x-10x slowdown. This is never
good enough.

--kcc

Hot areas will still be hot and cold areas will not be affected.
>
> Yaron
>
>
>
> 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com>:
>
>> Hi,
>>
>> The current design of -fprofile-instr-generate has the same fundamental
>> flaw
>> as the old gcc's gcov instrumentation: it has contention on counters.
>> A trivial synthetic test case was described here:
>> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html
>>
>> For the problem to appear we need to have a hot function that is
>> simultaneously executed
>> by multiple threads -- then we will have high contention on the racy
>> profile counters.
>>
>> Such situation is not necessary very frequent, but when it happens
>> -fprofile-instr-generate becomes barely usable due to huge slowdown
>> (5x-10x)
>>
>> An example is the multi-threaded vp9 video encoder.
>>
>> git clone https://chromium.googlesource.com/webm/libvpx
>> cd libvpx/
>> F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F"
>> CXX="clang++ $F" LD="clang++ $F" ./configure
>> make -j32
>> # get sample video from from
>> https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m
>> time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m
>>
>> When running single-threaded, -fprofile-instr-generate adds reasonable
>> ~15% overhead
>> (8.5 vs 10 seconds)
>> When running with 8 threads, it has 7x overhead (3.5 seconds vs 26
>> seconds).
>>
>> I am not saying that this flaw is a showstopper, but with the continued
>> move
>> towards multithreading it will be hurting more and more users of coverage
>> and PGO.
>> AFAICT, most of our PGO users simply can not run their software in
>> single-threaded mode,
>> and some of them surely have hot functions running in all threads at
>> once.
>>
>> At the very least we should document this problem, but better try fixing
>> it.
>>
>> Some ideas:
>>
>> - per-thread counters. Solves the problem at huge cost in RAM per-thread
>> - 8-bit per-thread counters, dumping into central counters on overflow.
>> - per-cpu counters (not portable, requires very modern kernel with lots
>> of patches)
>> - sharded counters: each counter represented as N counters sitting in
>> different cache lines. Every thread accesses the counter with index TID%N.
>> Solves the problem partially, better with larger values of N, but then
>> again it costs RAM.
>> - reduce contention on hot counters by not incrementing them if they are
>> big enough:
>>    {if (counter < 65536) counter++}; This reduces the accuracy though. Is
>> that bad for PGO?
>> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter))
>> == 0) counter++;
>>
>> Other thoughts?
>>
>> --kcc
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/d112c0ba/attachment.html>