[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Xinliang David Li xinliangli at gmail.com
Thu Apr 17 13:51:52 PDT 2014


Good thinking, but why do you think runtime  selection of shard count is
better than compile time selection? For single threaded apps, shard count
is always 1, so why paying the penalty to check thread id each time
function is entered?

For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES
to avoid excessive memory consumption, then you always end up with N ==
MAX. If MAX is larger than NUM_OF_CORES,  for large MT apps, the # of
 threads tends to be larger than NUM_OF_CORES, so it also ends up with N ==
MAX.  For rare cases, the shard count may switch between MAX and
NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter
arrays each time it changes.

Making N non compile time constant also makes the indexing more expensive.
Of course we can ignore thread migration and do CSE on it.

David

On Thu, Apr 17, 2014 at 1:06 PM, Chandler Carruth <chandlerc at google.com>wrote:

> Having thought a bit about the best strategy to solve this, I think we
> should use a tradeoff of memory to reduce contention. I don't really like
> any of the other options as much, if we can get that one to work. Here is
> my specific suggestion:
>
> On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at google.com> wrote:
>
>> - per-cpu counters (not portable, requires very modern kernel with lots
>> of patches)
>> - sharded counters: each counter represented as N counters sitting in
>> different cache lines. Every thread accesses the counter with index TID%N.
>> Solves the problem partially, better with larger values of N, but then
>> again it costs RAM.
>>
>
> I think we should combine these somewhat.
>
> At an abstract level, I think we should design the profiling to support up
> to N shards of counters.
>
> I think we should have a dynamic number of shards if possible. The goal
> here is that if you never need multiple shards (single threaded) you pay
> essentially zero cost. I would have a global number of shards that changes
> rarely, and re-compute it on entry to each function with something along
> the lines of:
>
> if (thread-ID != main's thread-ID && shard_count == 1) {
>   shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
> NUMBER_OF_CORES));
>   // if shard_count changed with this, we can also call a library routine
> here that does the work of allocating the actual extra shards.
> }
>
> MAX is a fixed cap so even on systems with 100s of cores we don't do
> something silly. NUBER_OF_THREADS, if supported on the OS, can limit the
> shards when we only have a small number of threads in the program.
> NUMBER_OF_CORES, if supported on the OS, can limit the shards. If we don't
> have the number of threads, we just use the number of cores. If we don't
> have the number of cores, we can just guess 8 (or something).
>



>
> Then, we can gracefully fall back on the following strategies to pick an
> index into the shards:
>
> - Per-core non-atomic counter updates (if we support them) using
> restartable sequences
> - Use a core-ID as the index into the shards to statistically minimize the
> contention, and do the increments atomically so it remains correct even if
> the core-ID load happens before a core migration and the counter increment
> occurs afterward
> - Use (thread-ID % number of cores) if we don't have support for getting a
> core-ID from the target OS. This will still have a reasonable distribution
> I suspect, even if not perfect.
>
>
> Finally, I wouldn't merge on shutdown if possible. I would just write
> larger raw profile data for multithreaded runs, and let the profdata tool
> merge them.
>
>
> If this is still too much memory, then I would suggest doing the above,
> but doing it independently for each function so that only those functions
> actually called via multithreaded code end up sharding their counters.
>
>
> I think this would be reasonably straightforward to implement, not
> significantly grow the cost of single-threaded instrumentation, and largely
> mitigate the contention on the counters. It can benefit from advanced hooks
> into the OS when those are available, but seems pretty easy to implement on
> roughly any OS with reasonable tradeoffs. The only really hard requirement
> is the ability to get a thread-id, but I think that is fairly reasonable
> (C++ even makes this essentially mandatory).
>
> Thoughts?
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/c614ecb2/attachment.html>


More information about the llvm-dev mailing list