[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Dmitry Vyukov dvyukov at google.com
Fri Apr 18 02:27:04 PDT 2014


On Fri, Apr 18, 2014 at 1:04 PM, Chandler Carruth <chandlerc at google.com>wrote:

>
> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>wrote:
>
>> > MAX is a fixed cap so even on systems with 100s of cores we don't do
>> something silly.
>>
>> Makes not sense to me.
>> Why do you want only systems with up to 100 cores to be fast and
>> scalable? 100+ core system *especially* need good scalability (contention
>> tends to be superlinear).
>>
>
> Please don't argue motives, and instead argue the technical merits. I do
> want systems with 100s of cores to be fast and scalable.
>
>
That's what I've read in the code above.
If there was a subsequent correction, sorry, this thread is long.




>  What you are proposing is basically: If I have 10 engineers in my
>> company, I probably want to give them 10 working desks as well. But let's
>> not go insane. If I have 1000 engineers, 100 desks must be enough for them.
>> This must reduce costs.
>> The baseline memory consumption for systems (and amount of RAM!) is
>> O(NCORES), not O(1). In some read-mostly cases it's possible to achieve
>> O(1) memory consumption, and that's great. But if it's not the case here,
>> let it be so.
>
>
> I think you are drastically overstating what I am suggesting. The bad
> analogy isn't helping.
>
> The only way we get contention is if we have the same function accessing
> the same counters within the function on multiple cores at the same time.
> It is entirely conceivable that programs which manage to do this for
> *every* core in a system with many hundreds of cores are rare. As a
> consequence, it might be a practical compromise to reduce the number of
> shards below the number of cores if the memory overhead is not providing
> commensurate performance. Clearly, measurements and such are needed here,
> but it is at least a tunable knob that we should know about and consider in
> our measurements.
>
>

I do not agree.
First, lots of programs determine number of cores and all of them.
Second, HPC applications all execute the same small computational kernel in
all threads. Server applications can do mostly decryption or decompression
or deserialization on all cores. Data parallel client applications for
video/image/audio processing execute the same code on all cores.
If we combine these, we get exactly -- all cores execute the same small
piece of code.





>
>>
>>
>> > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS,
>> NUMBER_OF_CORES))
>>
>> Threads do not produce contention, it's cores that produce contention.
>>
>
> It is the *combination* of threads on cores which produce contention. The
> above 'max' should have been a 'min', sorry for that confusion. The point
> was to reduce the shards if *either* the number of cores is small or the
> number of threads is small.
>
> The formula must be:  shard_count = k*NCORES
>> And if you want less memory in single-threaded case, then: shard_count =
>> min(k*NCORES, c*NTHREADS)
>
>
> Which is what I intended to write.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/5052d3b8/attachment.html>


More information about the llvm-dev mailing list