[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Dmitry Vyukov dvyukov at google.com
Fri Apr 18 00:44:37 PDT 2014


On Fri, Apr 18, 2014 at 11:32 AM, Dmitry Vyukov <dvyukov at google.com> wrote:

> On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote:
>
>> Hi,
>>
>> This is long thread, so I will combine several comments into single email.
>>
>>
>> >> - 8-bit per-thread counters, dumping into central counters on overflow.
>> >The overflow will happen very quickly with 8bit counter.
>>
>> Yes, but it reduces contention by 256x (a thread must execute at least
>> 256 loop iterations between increments). In practice, if you reduce
>> contention below some threshold, it does not represent a problem anymore.
>>
>>
>>
>> >> - per-thread counters. Solves the problem at huge cost in RAM
>> per-thread
>> >It is not practical. Especially for TLS counters -- it creates huge
>> pressure on stack memory.
>>
>> Do we have any numbers about number of counters? If there are 100K 1-byte
>> counters, I would consider it as practical.
>>
>>
>>
>> > - self-cooling logarithmic counters: if ((fast_random() % (1 <<
>> counter)) == 0) counter++;
>>
>> This option did not receive any attention. While it has O(1) memory
>> consumption and reduces contention.
>> Note that there are several tunables here: first -- how implement
>> fast_rand: it can be a per-thread LCG, or per-thread counter or set of
>> counter, or something else; second -- frequency of increment, e.g. it
>> actually can be additive and/or bounded (because if we reduce frequency of
>> increments by 1000x, this must be enough).
>>
>>
>>
>> > In Google GCC, we implemented another technique which proves to be very
>> effective -- it is called FDO sampling.
>> > Basically counters will be updated every N samples.
>>
>> How does it work?
>>
>>
>>
>> >> It seems to me like we’re going to have a hard time getting good
>> multithreaded performance without significant impact on the single-threaded
>> behavior.
>> > I don't really agree.
>>
>> Disagreement seconded.
>> For single-threaded case we are talking about constant overhead, while
>> for multithreaded case -- about superlinear overhead.
>> Having said that, if there is a *simple* way to modify the final scheme
>> to significantly reduce single-threaded overheads, then or course it's
>> worth doing. But it's not the cornerstone.
>>
>>
>>
>> > MAX is a fixed cap so even on systems with 100s of cores we don't do
>> something silly.
>>
>> Makes not sense to me.
>> Why do you want only systems with up to 100 cores to be fast and
>> scalable? 100+ core system *especially* need good scalability (contention
>> tends to be superlinear).
>> What you are proposing is basically: If I have 10 engineers in my
>> company, I probably want to give them 10 working desks as well. But let's
>> not go insane. If I have 1000 engineers, 100 desks must be enough for them.
>> This must reduce costs.
>> The baseline memory consumption for systems (and amount of RAM!) is
>> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
>> O(1) memory consumption, and that's great. But if it's not the case here,
>> let it be so.
>>
>>
>>
>> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
>> NUMBER_OF_CORES))
>>
>> Threads do not produce contention, it's cores that produce contention.
>> The formula must be:  shard_count = k*NCORES
>> And if you want less memory in single-threaded case, then: shard_count =
>> min(k*NCORES, c*NTHREADS)
>>
>>
>>
>> >We are talking about developers here. Nobody would know the exact thread
>> counts, but developers know the ballpark number
>>
>> I strongly believe that we must relief developers from this choice during
>> build time, and do our best to auto-tune (if the final scheme requires
>> tuning).
>> First, such questions puts unnecessary burden on developers. We don't ask
>> what register allocation algorithm to use for each function, right?
>> Second, there are significant chances we will get a wrong answer, because
>> e.g. developer's view of how threaded is the app can differ from reality or
>> from our classification.
>> Third, the app can be build by a build engineer; or the procedure can be
>> applied to a base with 10'000 apps; or threaded-ness can change; or the app
>> can work in different modes; or the app can have different phases.
>>
>>
>>
>> >Another danger involved with dynamically resizing the counter is that it
>> requires a global or per function lock to access the counters. The cost of
>> this can be really high.
>>
>> It's possible to do it w/o a mutex, fast-path overhead is only a
>> virtually zero overhead (if implemented properly in the compiler) atomic
>> consume load:
>>
>> const int maxshard = 4096;
>> uint64* shard[maxshard];
>> atomic<int> shardmask;
>>
>> void inline inccounter(int idx)
>> {
>> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);
>>  shard[shardidx][idx]++;
>> }
>>
>> int pthread_create(...)
>> {
>> if (updateshardcount()) {
>>  shardlock();
>> if (updateshardcount()) {
>> int newcount = computeshardcount();
>>  for (int i = oldcount; i < newcount; i++)
>> shard[i] = malloc(ncounter*sizeof(uint64));
>> atomic_store(&shardmask, newcount-1, memory_order_release);
>>  }
>> shardunlock();
>> }
>> ...
>> }
>>
>>
>>
> If we go with this scheme, for tid I would do:
>
> __thread threadid;
>
> int pthread_create(...)
> {
>     threadid = goohhash(gettid());
>     ...
> }
>
> void userfunc()
> {
>     int tid = threadid+1;
>     threadid = tid;
>

Kostya noted that this increment is a bad idea. Because each thread will
access all possible cache lines for each counter. I agree, this is a bad
idea.

What bothers me is that if we do tid%something, there is non-zero chance
that significant amount of active threads will be mapped onto the same
shard. And then get back to where we start -- we have significant
contention.

Ideally it would be something like:

on_thread_rescheduling()
{
  tid = getcpuid();
}

but common OSes do not provide necessary interfaces.




>     // use tid in profiling
>     ...
> }
>
> This avoids the problem of potentially expensive gettid(), and avoids a
> possible persistent interference between tids.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/28f5b3e2/attachment.html>


More information about the llvm-dev mailing list