<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <span dir="ltr"><<a href="mailto:yaron.keren@gmail.com" target="_blank">yaron.keren@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="rtl"><div dir="ltr">If accuracy is not critical, incrementing the counters without any guards might be good enough.</div>
</div></blockquote><div><br></div><div>No. Contention on the counters leads to 5x-10x slowdown. This is never good enough. </div><div><br></div><div>--kcc </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="rtl"><div dir="ltr">Hot areas will still be hot and cold areas will not be affected.<br></div><div dir="ltr">
<br></div><div dir="ltr">Yaron</div><div dir="ltr"><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote"><div dir="ltr">2014-04-17 15:21 GMT+03:00 Kostya Serebryany <span dir="ltr"><<a href="mailto:kcc@google.com" target="_blank">kcc@google.com</a>></span>:</div>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">Hi, <div><br></div><div>The current design of -fprofile-instr-generate has the same fundamental flaw </div>
<div>as the old gcc's gcov instrumentation: it has contention on counters. </div><div>A trivial synthetic test case was described here: </div>
<div><a href="http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html" target="_blank">http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html</a><br></div><div><br></div><div>For the problem to appear we need to have a hot function that is simultaneously executed </div>
<div>by multiple threads -- then we will have high contention on the racy profile counters.</div><div><br></div><div><div>Such situation is not necessary very frequent, but when it happens </div><div>-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)<br>
</div><div><br></div><div>An example is the multi-threaded vp9 video encoder.</div><div><br></div><div>git clone <a href="https://chromium.googlesource.com/webm/libvpx" target="_blank">https://chromium.googlesource.com/webm/libvpx</a><br>
</div><div>cd libvpx/<br></div><div>F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure<br></div><div>make -j32 </div><div># get sample video from from <a href="https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m" target="_blank">https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m</a><br>
</div><div>time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m <br></div><div><br></div><div>When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead<br></div><div>(8.5 vs 10 seconds)</div><div>When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).</div>
<div><br></div><div>I am not saying that this flaw is a showstopper, but with the continued move</div><div>towards multithreading it will be hurting more and more users of coverage and PGO.</div><div>AFAICT, most of our PGO users simply can not run their software in single-threaded mode,</div>
<div>and some of them surely have hot functions running in all threads at once. </div><div><br></div><div>At the very least we should document this problem, but better try fixing it. </div><div><br></div><div>Some ideas:</div>
<div><br></div><div>- per-thread counters. Solves the problem at huge cost in RAM per-thread</div><div>- 8-bit per-thread counters, dumping into central counters on overflow. </div><div>- per-cpu counters (not portable, requires very modern kernel with lots of patches)</div>
<div>- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.</div>
<div>- reduce contention on hot counters by not incrementing them if they are big enough:</div><div> {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?</div><div>- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;</div>
<div><br></div><div>Other thoughts?<br></div><div><br></div><div>--kcc </div><div><br></div><div><br></div><div><br></div></div></div>
<br></div></div>_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
<br></blockquote></div><br></div>
</blockquote></div><br></div></div>