[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 18 00:50:46 PDT 2014

On Fri, Apr 18, 2014 at 11:44 AM, Dmitry Vyukov <dvyukov at google.com> wrote:

> On Fri, Apr 18, 2014 at 11:32 AM, Dmitry Vyukov <dvyukov at google.com>wrote:
>
>> On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote:
>>
>>> Hi,
>>>
>>> This is long thread, so I will combine several comments into single
>>> email.
>>>
>>>
>>> >> - 8-bit per-thread counters, dumping into central counters on
>>> overflow.
>>> >The overflow will happen very quickly with 8bit counter.
>>>
>>> Yes, but it reduces contention by 256x (a thread must execute at least
>>> 256 loop iterations between increments). In practice, if you reduce
>>> contention below some threshold, it does not represent a problem anymore.
>>>
>>>
>>>
>>> >> - per-thread counters. Solves the problem at huge cost in RAM
>>> per-thread
>>> >It is not practical. Especially for TLS counters -- it creates huge
>>> pressure on stack memory.
>>>
>>> Do we have any numbers about number of counters? If there are 100K
>>> 1-byte counters, I would consider it as practical.
>>>
>>>
>>>
>>> > - self-cooling logarithmic counters: if ((fast_random() % (1 <<
>>> counter)) == 0) counter++;
>>>
>>> This option did not receive any attention. While it has O(1) memory
>>> consumption and reduces contention.
>>> Note that there are several tunables here: first -- how implement
>>> fast_rand: it can be a per-thread LCG, or per-thread counter or set of
>>> counter, or something else; second -- frequency of increment, e.g. it
>>> actually can be additive and/or bounded (because if we reduce frequency of
>>> increments by 1000x, this must be enough).
>>>
>>>
>>>
>>> > In Google GCC, we implemented another technique which proves to be
>>> very effective -- it is called FDO sampling.
>>> > Basically counters will be updated every N samples.
>>>
>>> How does it work?
>>>
>>>
>>>
>>> >> It seems to me like we’re going to have a hard time getting good
>>> multithreaded performance without significant impact on the single-threaded
>>> behavior.
>>> > I don't really agree.
>>>
>>> Disagreement seconded.
>>> For single-threaded case we are talking about constant overhead, while
>>> for multithreaded case -- about superlinear overhead.
>>> Having said that, if there is a *simple* way to modify the final scheme
>>> to significantly reduce single-threaded overheads, then or course it's
>>> worth doing. But it's not the cornerstone.
>>>
>>>
>>>
>>> > MAX is a fixed cap so even on systems with 100s of cores we don't do
>>> something silly.
>>>
>>> Makes not sense to me.
>>> Why do you want only systems with up to 100 cores to be fast and
>>> scalable? 100+ core system *especially* need good scalability (contention
>>> tends to be superlinear).
>>> What you are proposing is basically: If I have 10 engineers in my
>>> company, I probably want to give them 10 working desks as well. But let's
>>> not go insane. If I have 1000 engineers, 100 desks must be enough for them.
>>> This must reduce costs.
>>> The baseline memory consumption for systems (and amount of RAM!) is
>>> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
>>> O(1) memory consumption, and that's great. But if it's not the case here,
>>> let it be so.
>>>
>>>
>>>
>>> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
>>> NUMBER_OF_CORES))
>>>
>>> Threads do not produce contention, it's cores that produce contention.
>>> The formula must be:  shard_count = k*NCORES
>>> And if you want less memory in single-threaded case, then: shard_count =
>>> min(k*NCORES, c*NTHREADS)
>>>
>>>
>>>
>>> >We are talking about developers here. Nobody would know the exact
>>> thread counts, but developers know the ballpark number
>>>
>>> I strongly believe that we must relief developers from this choice
>>> during build time, and do our best to auto-tune (if the final scheme
>>> requires tuning).
>>> First, such questions puts unnecessary burden on developers. We don't
>>> ask what register allocation algorithm to use for each function, right?
>>> Second, there are significant chances we will get a wrong answer,
>>> because e.g. developer's view of how threaded is the app can differ from
>>> reality or from our classification.
>>> Third, the app can be build by a build engineer; or the procedure can be
>>> applied to a base with 10'000 apps; or threaded-ness can change; or the app
>>> can work in different modes; or the app can have different phases.
>>>
>>>
>>>
>>> >Another danger involved with dynamically resizing the counter is that
>>> it requires a global or per function lock to access the counters. The cost
>>> of this can be really high.
>>>
>>> It's possible to do it w/o a mutex, fast-path overhead is only a
>>> virtually zero overhead (if implemented properly in the compiler) atomic
>>> consume load:
>>>
>>> const int maxshard = 4096;
>>> uint64* shard[maxshard];
>>> atomic<int> shardmask;
>>>
>>> void inline inccounter(int idx)
>>> {
>>> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);
>>>  shard[shardidx][idx]++;
>>> }
>>>
>>> int pthread_create(...)
>>> {
>>> if (updateshardcount()) {
>>>  shardlock();
>>> if (updateshardcount()) {
>>> int newcount = computeshardcount();
>>>  for (int i = oldcount; i < newcount; i++)
>>> shard[i] = malloc(ncounter*sizeof(uint64));
>>> atomic_store(&shardmask, newcount-1, memory_order_release);
>>>  }
>>> shardunlock();
>>> }
>>> ...
>>> }
>>>
>>>
>>>
>> If we go with this scheme, for tid I would do:
>>
>> __thread threadid;
>>
>> int pthread_create(...)
>> {
>>     threadid = goohhash(gettid());
>>     ...
>> }
>>
>> void userfunc()
>> {
>>     int tid = threadid+1;
>>     threadid = tid;
>>
>
> Kostya noted that this increment is a bad idea. Because each thread will
> access all possible cache lines for each counter. I agree, this is a bad
> idea.
>
> What bothers me is that if we do tid%something, there is non-zero chance
> that significant amount of active threads will be mapped onto the same
> shard. And then get back to where we start -- we have significant
> contention.
>
> Ideally it would be something like:
>
> on_thread_rescheduling()
> {
>   tid = getcpuid();
> }
>
> but common OSes do not provide necessary interfaces.
>

Note also: if we store tid in a thread-local variable, we'll either have to
load it on every counter increment or keep it in a register,
which by itself will bring additional ~10% slowdown on x86_64. This is why
taking bits from %sp or %fs sounds attractive (it has problems too)

>
>
>
>
>>     // use tid in profiling
>>     ...
>> }
>>
>> This avoids the problem of potentially expensive gettid(), and avoids a
>> possible persistent interference between tids.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/04115e2b/attachment.html>