[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 18 00:13:13 PDT 2014

Hi,

This is long thread, so I will combine several comments into single email.

>> - 8-bit per-thread counters, dumping into central counters on overflow.
>The overflow will happen very quickly with 8bit counter.

Yes, but it reduces contention by 256x (a thread must execute at least 256
loop iterations between increments). In practice, if you reduce contention
below some threshold, it does not represent a problem anymore.

>> - per-thread counters. Solves the problem at huge cost in RAM per-thread
>It is not practical. Especially for TLS counters -- it creates huge
pressure on stack memory.

Do we have any numbers about number of counters? If there are 100K 1-byte
counters, I would consider it as practical.

> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter))
== 0) counter++;

This option did not receive any attention. While it has O(1) memory
consumption and reduces contention.
Note that there are several tunables here: first -- how implement
fast_rand: it can be a per-thread LCG, or per-thread counter or set of
counter, or something else; second -- frequency of increment, e.g. it
actually can be additive and/or bounded (because if we reduce frequency of
increments by 1000x, this must be enough).

> In Google GCC, we implemented another technique which proves to be very
effective -- it is called FDO sampling.
> Basically counters will be updated every N samples.

How does it work?

>> It seems to me like we’re going to have a hard time getting good
multithreaded performance without significant impact on the single-threaded
behavior.
> I don't really agree.

Disagreement seconded.
For single-threaded case we are talking about constant overhead, while for
multithreaded case -- about superlinear overhead.
Having said that, if there is a *simple* way to modify the final scheme to
significantly reduce single-threaded overheads, then or course it's worth
doing. But it's not the cornerstone.

> MAX is a fixed cap so even on systems with 100s of cores we don't do
something silly.

Makes not sense to me.
Why do you want only systems with up to 100 cores to be fast and scalable?
100+ core system *especially* need good scalability (contention tends to be
superlinear).
What you are proposing is basically: If I have 10 engineers in my company,
I probably want to give them 10 working desks as well. But let's not go
insane. If I have 1000 engineers, 100 desks must be enough for them. This
must reduce costs.
The baseline memory consumption for systems (and amount of RAM!) is
O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
O(1) memory consumption, and that's great. But if it's not the case here,
let it be so.

> shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))

Threads do not produce contention, it's cores that produce contention.
The formula must be:  shard_count = k*NCORES
And if you want less memory in single-threaded case, then: shard_count =
min(k*NCORES, c*NTHREADS)

>We are talking about developers here. Nobody would know the exact thread
counts, but developers know the ballpark number

I strongly believe that we must relief developers from this choice during
build time, and do our best to auto-tune (if the final scheme requires
tuning).
First, such questions puts unnecessary burden on developers. We don't ask
what register allocation algorithm to use for each function, right?
Second, there are significant chances we will get a wrong answer, because
e.g. developer's view of how threaded is the app can differ from reality or
from our classification.
Third, the app can be build by a build engineer; or the procedure can be
applied to a base with 10'000 apps; or threaded-ness can change; or the app
can work in different modes; or the app can have different phases.

>Another danger involved with dynamically resizing the counter is that it
requires a global or per function lock to access the counters. The cost of
this can be really high.

It's possible to do it w/o a mutex, fast-path overhead is only a virtually
zero overhead (if implemented properly in the compiler) atomic consume load:

const int maxshard = 4096;
uint64* shard[maxshard];
atomic<int> shardmask;

void inline inccounter(int idx)
{
int shardidx = gettid() & atomic_load(&shardmask, memory_order_consume);
 shard[shardidx][idx]++;
}

int pthread_create(...)
{
if (updateshardcount()) {
 shardlock();
if (updateshardcount()) {
int newcount = computeshardcount();
 for (int i = oldcount; i < newcount; i++)
shard[i] = malloc(ncounter*sizeof(uint64));
atomic_store(&shardmask, newcount-1, memory_order_release);
 }
shardunlock();
}
...
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/af5fb0a1/attachment.html>