<div dir="ltr">Hi,<br><br>This is long thread, so I will combine several comments into single email.<br><br>>> - 8-bit per-thread counters, dumping into central counters on overflow.<br>>The overflow will happen very quickly with 8bit counter.<br>


<br>Yes, but it reduces contention by 256x (a thread must execute at least 256 loop iterations between increments). In practice, if you reduce contention below some threshold, it does not represent a problem anymore.<br>

<br>

<br>>> - per-thread counters. Solves the problem at huge cost in RAM per-thread<br>>It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.<br><br>Do we have any numbers about number of counters? If there are 100K 1-byte counters, I would consider it as practical.<br>


<br> <br>> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++;<br><br>This option did not receive any attention. While it has O(1) memory consumption and reduces contention.<br>


Note that there are several tunables here: first -- how implement fast_rand: it can be a per-thread LCG, or per-thread counter or set of counter, or something else; second -- frequency of increment, e.g. it actually can be additive and/or bounded (because if we reduce frequency of increments by 1000x, this must be enough).<br>


<br><br>> In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling.<br>> Basically counters will be updated every N samples.<br><br>How does it work?<br><br><br>


>> It seems to me like we’re going to have a hard time getting good multithreaded performance without significant impact on the single-threaded behavior.<br>> I don't really agree.<br><br>Disagreement seconded.<br>


For single-threaded case we are talking about constant overhead, while for multithreaded case -- about superlinear overhead.<br>Having said that, if there is a *simple* way to modify the final scheme to significantly reduce single-threaded overheads, then or course it's worth doing. But it's not the cornerstone.<br>


<br><br>> MAX is a fixed cap so even on systems with 100s of cores we don't do something silly.<br><br>Makes not sense to me.<br>Why do you want only systems with up to 100 cores to be fast and scalable? 100+ core system *especially* need good scalability (contention tends to be superlinear).<br>


What you are proposing is basically: If I have 10 engineers in my company, I probably want to give them 10 working desks as well. But let's not go insane. If I have 1000 engineers, 100 desks must be enough for them. This must reduce costs.<br>


The baseline memory consumption for systems (and amount of RAM!) is O(NCORES), not O(1). In some read-mostly cases it's possible to achieve O(1) memory consumption, and that's great. But if it's not the case here, let it be so.<br>


<br><br>> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))<br><br>Threads do not produce contention, it's cores that produce contention.<br>The formula must be:  shard_count = k*NCORES<br>

And if you want less memory in single-threaded case, then: shard_count = min(k*NCORES, c*NTHREADS)<br>

<br><br>>We are talking about developers here. Nobody would know the exact thread counts, but developers know the ballpark number<br><br>I strongly believe that we must relief developers from this choice during build time, and do our best to auto-tune (if the final scheme requires tuning).<br>


First, such questions puts unnecessary burden on developers. We don't ask what register allocation algorithm to use for each function, right?<br>Second, there are significant chances we will get a wrong answer, because e.g. developer's view of how threaded is the app can differ from reality or from our classification.<br>


Third, the app can be build by a build engineer; or the procedure can be applied to a base with 10'000 apps; or threaded-ness can change; or the app can work in different modes; or the app can have different phases.<br>


<br><br>>Another danger involved with dynamically resizing the counter is that it requires a global or per function lock to access the counters. The cost of this can be really high.<span class=""></span><span class=""></span><br>


<div><br></div><div>It's possible to do it w/o a mutex, fast-path overhead is only a virtually zero overhead (if implemented properly in the compiler) atomic consume load:</div><div><br></div><div><div>const int maxshard = 4096;</div>


<div>uint64* shard[maxshard];</div><div>atomic<int> shardmask;</div><div><br></div><div>void inline inccounter(int idx)</div><div>{</div><div><span class="" style="white-space:pre">   </span>int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);</div>


<div><span class="" style="white-space:pre">    </span>shard[shardidx][idx]++;</div><div>}</div><div><br></div><div>int pthread_create(...)</div><div>{</div><div><span class="" style="white-space:pre">     </span>if (updateshardcount()) {</div>


<div><span class="" style="white-space:pre">            </span>shardlock();</div><div><span class="" style="white-space:pre">               </span>if (updateshardcount()) {</div><div><span class="" style="white-space:pre">                  </span>int newcount = computeshardcount();</div>


<div><span class="" style="white-space:pre">                    </span>for (int i = oldcount; i < newcount; i++)</div><div><span class="" style="white-space:pre">                               </span>shard[i] = malloc(ncounter*sizeof(uint64));</div><div><span class="" style="white-space:pre">                        </span>atomic_store(&shardmask, newcount-1, memory_order_release);</div>


<div><span class="" style="white-space:pre">            </span>}</div><div><span class="" style="white-space:pre">          </span>shardunlock();</div><div><span class="" style="white-space:pre">     </span>}</div><div><span class="" style="white-space:pre">  </span>...<span class="" style="white-space:pre">       </span></div>


<div>}</div></div><div><br></div><div><br></div></div>