[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 18 02:04:59 PDT 2014

On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com> wrote:

> > MAX is a fixed cap so even on systems with 100s of cores we don't do
> something silly.
>
> Makes not sense to me.
> Why do you want only systems with up to 100 cores to be fast and scalable?
> 100+ core system *especially* need good scalability (contention tends to be
> superlinear).
>

Please don't argue motives, and instead argue the technical merits. I do
want systems with 100s of cores to be fast and scalable.

What you are proposing is basically: If I have 10 engineers in my company,
> I probably want to give them 10 working desks as well. But let's not go
> insane. If I have 1000 engineers, 100 desks must be enough for them. This
> must reduce costs.
> The baseline memory consumption for systems (and amount of RAM!) is
> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
> O(1) memory consumption, and that's great. But if it's not the case here,
> let it be so.

I think you are drastically overstating what I am suggesting. The bad
analogy isn't helping.

The only way we get contention is if we have the same function accessing
the same counters within the function on multiple cores at the same time.
It is entirely conceivable that programs which manage to do this for
*every* core in a system with many hundreds of cores are rare. As a
consequence, it might be a practical compromise to reduce the number of
shards below the number of cores if the memory overhead is not providing
commensurate performance. Clearly, measurements and such are needed here,
but it is at least a tunable knob that we should know about and consider in
our measurements.

>
>
>
> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES))
>
> Threads do not produce contention, it's cores that produce contention.
>

It is the *combination* of threads on cores which produce contention. The
above 'max' should have been a 'min', sorry for that confusion. The point
was to reduce the shards if *either* the number of cores is small or the
number of threads is small.

The formula must be:  shard_count = k*NCORES
> And if you want less memory in single-threaded case, then: shard_count =
> min(k*NCORES, c*NTHREADS)

Which is what I intended to write.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/eb16047c/attachment.html>