<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <span dir="ltr"><<a href="mailto:kcc@google.com" target="_blank">kcc@google.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi, <div><br></div><div>The current design of -fprofile-instr-generate has the same fundamental flaw </div>
<div>as the old gcc's gcov instrumentation: it has contention on counters. </div><div>A trivial synthetic test case was described here: </div>
<div><a href="http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html" target="_blank">http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html</a><br></div><div><br></div><div>For the problem to appear we need to have a hot function that is simultaneously executed </div>
<div>by multiple threads -- then we will have high contention on the racy profile counters.</div><div><br></div><div><div>Such situation is not necessary very frequent, but when it happens </div><div>-fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x)<br>
</div></div></div></blockquote><div><br></div><div>We have seen even larger slowdowns, but it is uncommon, nor have I heard many complaints about it.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div><div>
</div><div><br></div><div>An example is the multi-threaded vp9 video encoder.</div><div><br></div><div>git clone <a href="https://chromium.googlesource.com/webm/libvpx" target="_blank">https://chromium.googlesource.com/webm/libvpx</a><br>
</div><div>cd libvpx/<br></div><div>F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure<br></div><div>make -j32 </div><div># get sample video from from <a href="https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m" target="_blank">https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m</a><br>
</div><div>time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m <br></div><div><br></div><div>When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead<br></div><div>(8.5 vs 10 seconds)</div><div>When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds).</div>
<div><br></div><div>I am not saying that this flaw is a showstopper, but with the continued move</div><div>towards multithreading it will be hurting more and more users of coverage and PGO.</div><div>AFAICT, most of our PGO users simply can not run their software in single-threaded mode,</div>
<div>and some of them surely have hot functions running in all threads at once. </div><div><br></div><div>At the very least we should document this problem, but better try fixing it. </div><div><br></div></div></div></blockquote>
<div><br></div><div>Users can also select smaller (but still representative) training data set to solve the problem.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div><div></div><div>Some ideas:</div>
<div><br></div><div>- per-thread counters. Solves the problem at huge cost in RAM per-thread</div></div></div></blockquote><div><br></div><div>It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>- 8-bit per-thread counters, dumping into central counters on overflow. </div></div>
</div></blockquote><div><br></div><div>The overflow will happen very quickly with 8bit counter.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div><div>- per-cpu counters (not portable, requires very modern kernel with lots of patches)</div>
<div>- sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM.</div>
</div></div></blockquote><div><br></div><div>Interesting idea. This may work well.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>
<div>- reduce contention on hot counters by not incrementing them if they are big enough:</div><div> {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO?</div></div></div></blockquote>
<div><br></div><div>yes, it will be bad.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>- self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;</div>
<div><br></div></div></div></blockquote><div><br></div><div>Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high.</div>
<div><br></div><div>In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling. Basically counters will be updated every N samples. </div><div><br></div><div>David</div>
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div></div><div>Other thoughts?<br></div><div><br></div><div>--kcc </div><div><br></div>
<div><br></div><div><br></div></div></div>
<br>_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
<br></blockquote></div><br></div></div>