[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 25 10:22:27 PDT 2014

On Fri, Apr 25, 2014 at 8:44 PM, Bob Wilson <bob.wilson at apple.com> wrote:
>
> On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
>
>
> I can see that the behavior of our current instrumentation is going to be a
> problem for the kinds of applications that you’re looking at. If you can
> find a way to get the overhead down without losing accuracy
>
>
> What are your requirements for accuracy?
> Current implementation does not provide 100% accuracy, so it's
> something less than 100%. What is it?
> What use cases for numeric counters (as opposed to bool flag) do we
> need to support? Is it only feedback-driven optimizations?
>
>
> That’s a fair point. The current implementation is potentially inaccurate
> because the counter increments are not thread-safe. In a low-contention
> situation, that won’t matter much, but the counts could become quite
> inaccurate if there are multiple threads running the same code at the same
> time.
>
> I don’t have a specific goal in mind right now for accuracy. We plan to use
> this instrumentation for both PGO and code coverage. Some coverage users
> only care about a boolean check but others want to see the actual execution
> counts. If we display the count values and they are significantly different
> from the real execution counts, that will lead to much confusion.

If I would look at coverage report with exact numbers and observe 2
statements that must have the same hit count, but in the report they
have different hit counts; I would be highly confused.
Go coverage profiler has 3 modes of operation: set, count and atomic:
http://golang.org/cmd/go/#hdr-Description_of_testing_flags
As you can guess, the last one (atomic) is intended for exactly such
case -- when user looks at exact numbers.

> and without
> hurting the performance for applications without significant contention,
>
>
> What is the acceptable threshold? 0.01% would be fine, right? What is
> the maximum value that you are ready to agree with?
>
>
> Like I said, I don’t have a specific value in mind. My sense is that most of
> the applications we care about are quite different from the massively
> parallel code that Google cares about. We may have many threads but they’re
> all doing different things and contention is much less likely to be a
> problem.

Note that contention can arise even if threads are doing different
things but use common components. E.g. if you include a custom malloc
implementation into your program, then everything that allocates
becomes a point of contention.

> I really think we need to see specifics before we can decide anything here.
>
>
>
> then we can just adopt that. If you can’t do it without tradeoffs, then we
> should have a separate option to let those who care switch between different
> kinds of instrumentation.
>
> Using the PC to map to the source code is simply not going to work with
> -fprofile-instr-generate. The mapping from counter values to the
> user-visible execution counts is complex and relies on detailed knowledge of
> the clang ASTs.
>
>
>
> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
> TIME: real: 0.219; user: 0.430; system: 0.000
> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc && time
> ./a.out
> TIME: real: 3.743; user: 7.280; system: 0.000
>
> % cat coverage_mt_vec.cc
> #include <pthread.h>
> #include <vector>
>
> std::vector<int> v(1000);
>
> __attribute__((noinline)) void foo() { v[0] = 42; }
>
> void *Thread1(void *) {
>  for (int i = 0; i < 100000000; i++)
>    foo();
>  return 0;
> }
>
> __attribute__((noinline)) void bar() { v[999] = 66; }
>
> void *Thread2(void *) {
>  for (int i = 0; i < 100000000; i++)
>    bar();
>  return 0;
> }
>
> int main() {
>  static const int kNumThreads = 16;
>  pthread_t t[kNumThreads];
>  pthread_create(&t[0], 0, Thread1, 0);
>  pthread_create(&t[1], 0, Thread2, 0);
>  pthread_join(t[0], 0);
>  pthread_join(t[1], 0);
>  return 0;
> }
>
>
>
>
> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
>
>
>
>
> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
> wrote:
>
>
> Hi,
>
> This is long thread, so I will combine several comments into single
> email.
>
>
> - 8-bit per-thread counters, dumping into central counters on
> overflow.
>
> The overflow will happen very quickly with 8bit counter.
>
>
> Yes, but it reduces contention by 256x (a thread must execute at least
> 256 loop iterations between increments). In practice, if you reduce
> contention below some threshold, it does not represent a problem anymore.
>
>
>
> - per-thread counters. Solves the problem at huge cost in RAM
> per-thread
>
> It is not practical. Especially for TLS counters -- it creates huge
> pressure on stack memory.
>
>
> Do we have any numbers about number of counters? If there are 100K 1-byte
> counters, I would consider it as practical.
>
>
> A medium sized app I looked at has about 10M counters (arcs only).  It is
> also not uncommon to see such apps running with hundreds of threads.
>
>
>
>
>
>
>
> In Google GCC, we implemented another technique which proves to be very
> effective -- it is called FDO sampling.
> Basically counters will be updated every N samples.
>
>
> How does it work?
>
>
>
> Similar to how occurrences based PMU sampling work. Setting sampling
> period to 100 can reduce the instrumentation overhead by close to 100x
> without introducing much precision loss.
>
>
>
>
>
> It seems to me like we’re going to have a hard time getting good
> multithreaded performance without significant impact on the single-threaded
> behavior.
>
> I don't really agree.
>
>
>
> We are talking about developers here. Nobody would know the exact thread
> counts, but developers know the ballpark number
>
>
> I strongly believe that we must relief developers from this choice during
> build time, and do our best to auto-tune (if the final scheme requires
> tuning).
>
>
>
>
> That really depends.   If the tuning space is small, it won't be a problem
> for the developer/builder.
>
>
>
> First, such questions puts unnecessary burden on developers. We don't ask
> what register allocation algorithm to use for each function, right?
>
>
>
> Crazy developers can actually do that via internal options, but this is
> totally different case.  People just needs one flag to turn on/off sharding.
> When sharding is on, compiler can pick the best 'N' according to some
> heuristics at compile time.
>
>
>
> Second, there are significant chances we will get a wrong answer, because
> e.g. developer's view of how threaded is the app can differ from reality or
> from our classification.
> Third, the app can be build by a build engineer; or the procedure can be
> applied to a base with 10'000 apps; or threaded-ness can change; or the app
> can work in different modes; or the app can have different phases.
>
>
> We have forgotten to mention the benefit of implementation simplicity.  If
> the static/simple solution solves the problem for most of the use cases,
> designing fancy dynamic solution sounds like over-engineering to me.  It
> (overhead reduction technique) may also get in the way of easier functional
> enhancement in the future.