[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Xinliang David Li xinliangli at gmail.com
Fri Apr 18 12:45:22 PDT 2014


On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com> wrote:

> Hi,
>
> This is long thread, so I will combine several comments into single email.
>
>
> >> - 8-bit per-thread counters, dumping into central counters on overflow.
> >The overflow will happen very quickly with 8bit counter.
>
> Yes, but it reduces contention by 256x (a thread must execute at least 256
> loop iterations between increments). In practice, if you reduce contention
> below some threshold, it does not represent a problem anymore.
>
>
>
> >> - per-thread counters. Solves the problem at huge cost in RAM per-thread
> >It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>
A medium sized app I looked at has about 10M counters (arcs only).  It is
also not uncommon to see such apps running with hundreds of threads.



>
>
>
>
> > In Google GCC, we implemented another technique which proves to be very
> effective -- it is called FDO sampling.
> > Basically counters will be updated every N samples.
>
> How does it work?
>

Similar to how occurrences based PMU sampling work. Setting sampling period
to 100 can reduce the instrumentation overhead by close to 100x without
introducing much precision loss.


>
>
>
> >> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
> > I don't really agree.
>
>
> >We are talking about developers here. Nobody would know the exact thread
> counts, but developers know the ballpark number
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
>


That really depends.   If the tuning space is small, it won't be a problem
for the developer/builder.



> First, such questions puts unnecessary burden on developers. We don't ask
> what register allocation algorithm to use for each function, right?
>

Crazy developers can actually do that via internal options, but this is
totally different case.  People just needs one flag to turn on/off
sharding. When sharding is on, compiler can pick the best 'N' according to
some heuristics at compile time.



> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the app
> can work in different modes; or the app can have different phases.
>
>
We have forgotten to mention the benefit of implementation simplicity.  If
the static/simple solution solves the problem for most of the use cases,
designing fancy dynamic solution sounds like over-engineering to me.  It
(overhead reduction technique) may also get in the way of easier functional
enhancement in the future.

 David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/6861153d/attachment.html>


More information about the llvm-dev mailing list