<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <span dir="ltr"><<a href="mailto:dvyukov@google.com" target="_blank">dvyukov@google.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi,<br><br>This is long thread, so I will combine several comments into single email.<div class=""><br><br>

>> - 8-bit per-thread counters, dumping into central counters on overflow.<br></div><div class="">>The overflow will happen very quickly with 8bit counter.<br>


<br></div>Yes, but it reduces contention by 256x (a thread must execute at least 256 loop iterations between increments). In practice, if you reduce contention below some threshold, it does not represent a problem anymore.<div class="">

<br>

<br>

<br>>> - per-thread counters. Solves the problem at huge cost in RAM per-thread<br></div><div class="">>It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.<br><br></div>

Do we have any numbers about number of counters? If there are 100K 1-byte counters, I would consider it as practical.<div class=""><br></div></div></blockquote><div><br></div><div>A medium sized app I looked at has about 10M counters (arcs only).  It is also not uncommon to see such apps running with hundreds of threads.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class=""><br></div><div class=""><br>


<br><br>> In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling.<br>> Basically counters will be updated every N samples.<br><br></div>How does it work?</div>

</blockquote><div><br></div><div>Similar to how occurrences based PMU sampling work. Setting sampling period to 100 can reduce the instrumentation overhead by close to 100x without introducing much precision loss.</div><div>

 </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class=""><br><br><br>


>> It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior.<br></div><div class="">> I don't really agree.<br><br>

</div><div class=""><br>>We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number<br><br></div>I strongly believe that we must relief developers from this choice during build time, and do our best to auto-tune (if the final scheme requires tuning).<br>

</div></blockquote><div><br></div><div><br></div><div>That really depends.   If the tuning space is small, it won't be a problem for the developer/builder. </div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr">


First, such questions puts unnecessary burden on developers. We don't ask what register allocation algorithm to use for each function, right?<br></div></blockquote><div><br></div><div>Crazy developers can actually do that via internal options, but this is totally different case.  People just needs one flag to turn on/off sharding. When sharding is on, compiler can pick the best 'N' according to some heuristics at compile time.</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Second, there are significant chances we will get a wrong answer, because e.g. developer's view of how threaded is the app can differ from reality or from our classification.<br>


Third, the app can be build by a build engineer; or the procedure can be applied to a base with 10'000 apps; or threaded-ness can change; or the app can work in different modes; or the app can have different phases.<div class="">

<br></div></div></blockquote><div><br></div><div>We have forgotten to mention the benefit of implementation simplicity.  If the static/simple solution solves the problem for most of the use cases, designing fancy dynamic solution sounds like over-engineering to me.  It (overhead reduction technique) may also get in the way of easier functional enhancement in the future.</div>

<div><br></div><div> David</div></div><br></div></div>