[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Thu Apr 17 14:33:14 PDT 2014

On Thu, Apr 17, 2014 at 2:09 PM, Chandler Carruth <chandlerc at google.com>wrote:

>
> On Thu, Apr 17, 2014 at 1:51 PM, Xinliang David Li <xinliangli at gmail.com>wrote:
>
>> Good thinking, but why do you think runtime  selection of shard count is
>> better than compile time selection? For single threaded apps, shard count
>> is always 1, so why paying the penalty to check thread id each time
>> function is entered?
>>
>
> Because extremely few applications statically decide how many threads to
> use in the real world (in my experience). This is even more relevant if you
> consider each <unit of code, maybe post-inlined function> independently,
> where you might have many threads but near 0 overlapping functions on those
> threads. The number of cores also changes from machine to machine, and can
> even change based on the particular OS mode in which your application runs.
>

We are talking about developers here. Nobody would know the exact thread
counts, but developers know the ballpark number, which should be enough.
E.g. 1) my program is single threaded; 2) my program is mostly single
threaded with some lightweight helper threads; 3) my program is heavily
threaded without a single hotspot; 4) my program is heavily threaded with
hotspot contention ,etc.  Only 4) is of concern here.  Besides, user can
always find out if instrumentation build is too slow and decide which
strategy to use. For apps with distinct phases (e.g. ST->MT->ST), the
proposed approach may be useful, but it won't be the majority.

>
>> For multi-threaded apps, I would expect MAX to be smaller than
>> NUM_OF_CORES to avoid excessive memory consumption, then you always end up
>> with N == MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the
>> # of  threads tends to be larger than NUM_OF_CORES, so it also ends up with
>> N == MAX.  For rare cases, the shard count may switch between MAX and
>> NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
>> arrays each time it changes.
>>
>
> Sorry, this was just pseudo code, and very rough at that.
>
> The goal was to allow programs with >1 thread but significantly fewer
> threads than cores to not pay (in memory) for all of the shards. There are
> common patterns here such as applications that are essentially single
> threaded, but with one or two background threads. Also, the hard
> compile-time max is a compile time constant, but the number of cores isn't
> (see above) so at least once per execution of the program, we'll need to
> dynamically take the min of the two.
>
> See above -- for each cases (scenario 2), user normally has prior
knowledge.

>
>> Making N non compile time constant also makes the indexing more
>> expensive. Of course we can ignore thread migration and do CSE on it.
>>
>
> Yes, and a certain amount of this is actually fine because the whole point
> was to minimize contention rather than perfectly eliminate it.
>
>
Another danger involved with dynamically resizing the counter is that it
requires a global or per function lock to access the counters. The cost of
this can be really high.

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/126f9a65/attachment.html>