<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <span dir="ltr"><<a href="mailto:xinliangli@gmail.com" target="_blank">xinliangli@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra">Good thinking, but why do you think runtime  selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered?</div>

</blockquote><div><br></div><div>Because extremely few applications statically decide how many threads to use in the real world (in my experience). This is even more relevant if you consider each <unit of code, maybe post-inlined function> independently, where you might have many threads but near 0 overlapping functions on those threads. The number of cores also changes from machine to machine, and can even change based on the particular OS mode in which your application runs.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="gmail_extra">For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with N == MAX.  For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes.<br>

</div></blockquote><div><br></div><div>Sorry, this was just pseudo code, and very rough at that.</div><div><br></div><div>The goal was to allow programs with >1 thread but significantly fewer threads than cores to not pay (in memory) for all of the shards. There are common patterns here such as applications that are essentially single threaded, but with one or two background threads. Also, the hard compile-time max is a compile time constant, but the number of cores isn't (see above) so at least once per execution of the program, we'll need to dynamically take the min of the two.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="gmail_extra"></div>

<div class="gmail_extra"><br></div><div class="gmail_extra">Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it.</div></blockquote></div><div class="gmail_extra">

<br></div>Yes, and a certain amount of this is actually fine because the whole point was to minimize contention rather than perfectly eliminate it.<br><br></div></div>