<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 17, 2014 at 2:09 PM, Chandler Carruth <span dir="ltr"><<a href="mailto:chandlerc@google.com" target="_blank">chandlerc@google.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote"><div class="">On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <span dir="ltr"><<a href="mailto:xinliangli@gmail.com" target="_blank">xinliangli@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra">Good thinking, but why do you think runtime selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?</div>
</blockquote><div><br></div></div><div>Because extremely few applications statically decide how many threads to use in the real world (in my experience). This is even more relevant if you consider each <unit of code, maybe post-inlined function> independently, where you might have many threads but near 0 overlapping functions on those threads. The number of cores also changes from machine to machine, and can even change based on the particular OS mode in which your application runs.</div>
</div></div></div></blockquote><div><br></div><div><br></div><div>We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number, which should be enough. E.g. 1) my program is single threaded; 2) my program is mostly single threaded with some lightweight helper threads; 3) my program is heavily threaded without a single hotspot; 4) my program is heavily threaded with hotspot contention ,etc. Only 4) is of concern here. Besides, user can always find out if instrumentation build is too slow and decide which strategy to use. For apps with distinct phases (e.g. ST->MT->ST), the proposed approach may be useful, but it won't be the majority. </div>
<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div class="">
<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="gmail_extra">For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES, for large MT apps, the # of threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX. For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.<br>
</div></blockquote><div><br></div></div><div>Sorry, this was just pseudo code, and very rough at that.</div><div><br></div><div>The goal was to allow programs with >1 thread but significantly fewer threads than cores to not pay (in memory) for all of the shards. There are common patterns here such as applications that are essentially single threaded, but with one or two background threads. Also, the hard compile-time max is a compile time constant, but the number of cores isn't (see above) so at least once per execution of the program, we'll need to dynamically take the min of the two.</div>
<div class="">
<div><br></div></div></div></div></div></blockquote><div>See above -- for each cases (scenario 2), user normally has prior knowledge.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div class=""><div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra"></div>
<div class="gmail_extra"><br></div><div class="gmail_extra">Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.</div></blockquote></div></div>
<div class="gmail_extra">
<br></div>Yes, and a certain amount of this is actually fine because the whole point was to minimize contention rather than perfectly eliminate it.<br><br></div></div></blockquote><div><br></div><div>Another danger involved with dynamically resizing the counter is that it requires a global or per function lock to access the counters. The cost of this can be really high.</div>
<div><br></div><div>David </div></div><br></div></div>