[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Chandler Carruth chandlerc at google.com
Thu Apr 17 14:09:15 PDT 2014


On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinliangli at gmail.com>wrote:

> Good thinking, but why do you think runtime  selection of shard count is
> better than compile time selection? For single threaded apps, shard count
> is always 1, so why paying the penalty to check thread id each time
> function is entered?
>

Because extremely few applications statically decide how many threads to
use in the real world (in my experience). This is even more relevant if you
consider each <unit of code, maybe post-inlined function> independently,
where you might have many threads but near 0 overlapping functions on those
threads. The number of cores also changes from machine to machine, and can
even change based on the particular OS mode in which your application runs.


> For multi-threaded apps, I would expect MAX to be smaller than
> NUM_OF_CORES to avoid excessive memory consumption, then you always end up
> with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the
> # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with
> N == MAX.  For rare cases, the shard count may switch between MAX and
> NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
> arrays each time it changes.
>

Sorry, this was just pseudo code, and very rough at that.

The goal was to allow programs with >1 thread but significantly fewer
threads than cores to not pay (in memory) for all of the shards. There are
common patterns here such as applications that are essentially single
threaded, but with one or two background threads. Also, the hard
compile-time max is a compile time constant, but the number of cores isn't
(see above) so at least once per execution of the program, we'll need to
dynamically take the min of the two.


> Making N non compile time constant also makes the indexing more expensive.
> Of course we can ignore thread migration and do CSE on it.
>

Yes, and a certain amount of this is actually fine because the whole point
was to minimize contention rather than perfectly eliminate it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/eb134860/attachment.html>


More information about the llvm-dev mailing list