[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 18 00:32:01 PDT 2014

On Fri, Apr 18, 2014 at 11:13 AM, Dmitry Vyukov <dvyukov at google.com> wrote:

> Hi,
>
> This is long thread, so I will combine several comments into single email.
>
>
> >> - 8-bit per-thread counters, dumping into central counters on overflow.
> >The overflow will happen very quickly with 8bit counter.
>
> Yes, but it reduces contention by 256x (a thread must execute at least 256
> loop iterations between increments). In practice, if you reduce contention
> below some threshold, it does not represent a problem anymore.
>
>
>
> >> - per-thread counters. Solves the problem at huge cost in RAM per-thread
> >It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>
>
> > - self-cooling logarithmic counters: if ((fast_random() % (1 <<
> counter)) == 0) counter++;
>
> This option did not receive any attention. While it has O(1) memory
> consumption and reduces contention.
> Note that there are several tunables here: first -- how implement
> fast_rand: it can be a per-thread LCG, or per-thread counter or set of
> counter, or something else; second -- frequency of increment, e.g. it
> actually can be additive and/or bounded (because if we reduce frequency of
> increments by 1000x, this must be enough).
>
>
>
> > In Google GCC, we implemented another technique which proves to be very
> effective -- it is called FDO sampling.
> > Basically counters will be updated every N samples.
>
> How does it work?
>
>
>
> >> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
> > I don't really agree.
>
> Disagreement seconded.
> For single-threaded case we are talking about constant overhead, while for
> multithreaded case -- about superlinear overhead.
> Having said that, if there is a *simple* way to modify the final scheme to
> significantly reduce single-threaded overheads, then or course it's worth
> doing. But it's not the cornerstone.
>
>
>
> > MAX is a fixed cap so even on systems with 100s of cores we don't do
> something silly.
>
> Makes not sense to me.
> Why do you want only systems with up to 100 cores to be fast and scalable?
> 100+ core system *especially* need good scalability (contention tends to be
> superlinear).
> What you are proposing is basically: If I have 10 engineers in my company,
> I probably want to give them 10 working desks as well. But let's not go
> insane. If I have 1000 engineers, 100 desks must be enough for them. This
> must reduce costs.
> The baseline memory consumption for systems (and amount of RAM!) is
> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
> O(1) memory consumption, and that's great. But if it's not the case here,
> let it be so.
>
>
>
> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))
>
> Threads do not produce contention, it's cores that produce contention.
> The formula must be:  shard_count = k*NCORES
> And if you want less memory in single-threaded case, then: shard_count =
> min(k*NCORES, c*NTHREADS)
>
>
>
> >We are talking about developers here. Nobody would know the exact thread
> counts, but developers know the ballpark number
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
> First, such questions puts unnecessary burden on developers. We don't ask
> what register allocation algorithm to use for each function, right?
> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the app
> can work in different modes; or the app can have different phases.
>
>
>
> >Another danger involved with dynamically resizing the counter is that it
> requires a global or per function lock to access the counters. The cost of
> this can be really high.
>
> It's possible to do it w/o a mutex, fast-path overhead is only a virtually
> zero overhead (if implemented properly in the compiler) atomic consume load:
>
> const int maxshard = 4096;
> uint64* shard[maxshard];
> atomic<int> shardmask;
>
> void inline inccounter(int idx)
> {
> int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);
>  shard[shardidx][idx]++;
> }
>
> int pthread_create(...)
> {
> if (updateshardcount()) {
>  shardlock();
> if (updateshardcount()) {
> int newcount = computeshardcount();
>  for (int i = oldcount; i < newcount; i++)
> shard[i] = malloc(ncounter*sizeof(uint64));
> atomic_store(&shardmask, newcount-1, memory_order_release);
>  }
> shardunlock();
> }
> ...
> }
>
>
>
If we go with this scheme, for tid I would do:

__thread threadid;

int pthread_create(...)
{
    threadid = goohhash(gettid());
    ...
}

void userfunc()
{
    int tid = threadid+1;
    threadid = tid;
    // use tid in profiling
    ...
}

This avoids the problem of potentially expensive gettid(), and avoids a
possible persistent interference between tids.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/3ed5cdad/attachment.html>