<div dir="ltr">Hi, <div><br></div><div>The current design of -fprofile-instr-generate has the same fundamental flaw </div><div>as the old gcc's gcov instrumentation: it has contention on counters. </div><div>A trivial synthetic test case was described here: </div>

<div><a href="http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html">http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html</a><br></div><div><br></div><div>For the problem to appear we need to have a hot function that is simultaneously executed </div>

<div>by multiple threads -- then we will have high contention on the racy profile counters.</div><div><br></div><div><div>Such situation is not necessary very frequent, but when it happens </div><div>-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)<br>

</div><div><br></div><div>An example is the multi-threaded vp9 video encoder.</div><div><br></div><div>git clone <a href="https://chromium.googlesource.com/webm/libvpx">https://chromium.googlesource.com/webm/libvpx</a><br>

</div><div>cd libvpx/<br></div><div>F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure<br></div><div>make -j32 </div><div># get sample video from from <a href="https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m">https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m</a><br>

</div><div>time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m  <br></div><div><br></div><div>When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead<br></div><div>(8.5 vs 10 seconds)</div><div>When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).</div>

<div><br></div><div>I am not saying that this flaw is a showstopper, but with the continued move</div><div>towards multithreading it will be hurting more and more users of coverage and PGO.</div><div>AFAICT, most of our PGO users simply can not run their software in single-threaded mode,</div>

<div>and some of them surely have hot functions running in all threads at once. </div><div><br></div><div>At the very least we should document this problem, but better try fixing it. </div><div><br></div><div>Some ideas:</div>

<div><br></div><div>- per-thread counters. Solves the problem at huge cost in RAM per-thread</div><div>- 8-bit per-thread counters, dumping into central counters on overflow. </div><div>- per-cpu counters (not portable, requires very modern kernel with lots of patches)</div>

<div>- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.</div>

<div>- reduce contention on hot counters by not incrementing them if they are big enough:</div><div>   {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?</div><div>- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;</div>

<div><br></div><div>Other thoughts?<br></div><div><br></div><div>--kcc </div><div><br></div><div><br></div><div><br></div></div></div>