[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Bob Wilson bob.wilson at apple.com
Fri Apr 25 10:28:49 PDT 2014


On Apr 25, 2014, at 10:22 AM, Dmitry Vyukov <dvyukov at google.com> wrote:

> On Fri, Apr 25, 2014 at 8:44 PM, Bob Wilson <bob.wilson at apple.com> wrote:
>> 
>> On Apr 24, 2014, at 1:33 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
>> 
>> 
>> I can see that the behavior of our current instrumentation is going to be a
>> problem for the kinds of applications that you’re looking at. If you can
>> find a way to get the overhead down without losing accuracy
>> 
>> 
>> What are your requirements for accuracy?
>> Current implementation does not provide 100% accuracy, so it's
>> something less than 100%. What is it?
>> What use cases for numeric counters (as opposed to bool flag) do we
>> need to support? Is it only feedback-driven optimizations?
>> 
>> 
>> That’s a fair point. The current implementation is potentially inaccurate
>> because the counter increments are not thread-safe. In a low-contention
>> situation, that won’t matter much, but the counts could become quite
>> inaccurate if there are multiple threads running the same code at the same
>> time.
>> 
>> I don’t have a specific goal in mind right now for accuracy. We plan to use
>> this instrumentation for both PGO and code coverage. Some coverage users
>> only care about a boolean check but others want to see the actual execution
>> counts. If we display the count values and they are significantly different
>> from the real execution counts, that will lead to much confusion.
> 
> If I would look at coverage report with exact numbers and observe 2
> statements that must have the same hit count, but in the report they
> have different hit counts; I would be highly confused.
> Go coverage profiler has 3 modes of operation: set, count and atomic:
> http://golang.org/cmd/go/#hdr-Description_of_testing_flags
> As you can guess, the last one (atomic) is intended for exactly such
> case -- when user looks at exact numbers.

I haven’t worked with Go, but that is exactly the kind of distinction that I had in mind when I suggested that we may want to support different kinds of instrumentation. It sounds like the “set” mode is pretty appealing for your target applications, and I won’t be at all surprised if we end up wanting something like “atomic” for some code coverage users. I’m guessing that the current -fprofile-instr-generate approach is similar to the “count” mode.

> 
> 
> 
>> and without
>> hurting the performance for applications without significant contention,
>> 
>> 
>> What is the acceptable threshold? 0.01% would be fine, right? What is
>> the maximum value that you are ready to agree with?
>> 
>> 
>> Like I said, I don’t have a specific value in mind. My sense is that most of
>> the applications we care about are quite different from the massively
>> parallel code that Google cares about. We may have many threads but they’re
>> all doing different things and contention is much less likely to be a
>> problem.
> 
> 
> Note that contention can arise even if threads are doing different
> things but use common components. E.g. if you include a custom malloc
> implementation into your program, then everything that allocates
> becomes a point of contention.

Yes, of course.

> 
> 
> 
>> I really think we need to see specifics before we can decide anything here.
>> 
>> 
>> 
>> then we can just adopt that. If you can’t do it without tradeoffs, then we
>> should have a separate option to let those who care switch between different
>> kinds of instrumentation.
>> 
>> Using the PC to map to the source code is simply not going to work with
>> -fprofile-instr-generate. The mapping from counter values to the
>> user-visible execution counts is complex and relies on detailed knowledge of
>> the clang ASTs.
>> 
>> 
>> 
>> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
>> TIME: real: 0.219; user: 0.430; system: 0.000
>> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc && time
>> ./a.out
>> TIME: real: 3.743; user: 7.280; system: 0.000
>> 
>> % cat coverage_mt_vec.cc
>> #include <pthread.h>
>> #include <vector>
>> 
>> std::vector<int> v(1000);
>> 
>> __attribute__((noinline)) void foo() { v[0] = 42; }
>> 
>> void *Thread1(void *) {
>> for (int i = 0; i < 100000000; i++)
>>   foo();
>> return 0;
>> }
>> 
>> __attribute__((noinline)) void bar() { v[999] = 66; }
>> 
>> void *Thread2(void *) {
>> for (int i = 0; i < 100000000; i++)
>>   bar();
>> return 0;
>> }
>> 
>> int main() {
>> static const int kNumThreads = 16;
>> pthread_t t[kNumThreads];
>> pthread_create(&t[0], 0, Thread1, 0);
>> pthread_create(&t[1], 0, Thread2, 0);
>> pthread_join(t[0], 0);
>> pthread_join(t[1], 0);
>> return 0;
>> }
>> 
>> 
>> 
>> 
>> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at gmail.com>
>> wrote:
>> 
>> 
>> 
>> 
>> 
>> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
>> wrote:
>> 
>> 
>> Hi,
>> 
>> This is long thread, so I will combine several comments into single
>> email.
>> 
>> 
>> - 8-bit per-thread counters, dumping into central counters on
>> overflow.
>> 
>> The overflow will happen very quickly with 8bit counter.
>> 
>> 
>> Yes, but it reduces contention by 256x (a thread must execute at least
>> 256 loop iterations between increments). In practice, if you reduce
>> contention below some threshold, it does not represent a problem anymore.
>> 
>> 
>> 
>> - per-thread counters. Solves the problem at huge cost in RAM
>> per-thread
>> 
>> It is not practical. Especially for TLS counters -- it creates huge
>> pressure on stack memory.
>> 
>> 
>> Do we have any numbers about number of counters? If there are 100K 1-byte
>> counters, I would consider it as practical.
>> 
>> 
>> A medium sized app I looked at has about 10M counters (arcs only).  It is
>> also not uncommon to see such apps running with hundreds of threads.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> In Google GCC, we implemented another technique which proves to be very
>> effective -- it is called FDO sampling.
>> Basically counters will be updated every N samples.
>> 
>> 
>> How does it work?
>> 
>> 
>> 
>> Similar to how occurrences based PMU sampling work. Setting sampling
>> period to 100 can reduce the instrumentation overhead by close to 100x
>> without introducing much precision loss.
>> 
>> 
>> 
>> 
>> 
>> It seems to me like we’re going to have a hard time getting good
>> multithreaded performance without significant impact on the single-threaded
>> behavior.
>> 
>> I don't really agree.
>> 
>> 
>> 
>> We are talking about developers here. Nobody would know the exact thread
>> counts, but developers know the ballpark number
>> 
>> 
>> I strongly believe that we must relief developers from this choice during
>> build time, and do our best to auto-tune (if the final scheme requires
>> tuning).
>> 
>> 
>> 
>> 
>> That really depends.   If the tuning space is small, it won't be a problem
>> for the developer/builder.
>> 
>> 
>> 
>> First, such questions puts unnecessary burden on developers. We don't ask
>> what register allocation algorithm to use for each function, right?
>> 
>> 
>> 
>> Crazy developers can actually do that via internal options, but this is
>> totally different case.  People just needs one flag to turn on/off sharding.
>> When sharding is on, compiler can pick the best 'N' according to some
>> heuristics at compile time.
>> 
>> 
>> 
>> Second, there are significant chances we will get a wrong answer, because
>> e.g. developer's view of how threaded is the app can differ from reality or
>> from our classification.
>> Third, the app can be build by a build engineer; or the procedure can be
>> applied to a base with 10'000 apps; or threaded-ness can change; or the app
>> can work in different modes; or the app can have different phases.
>> 
>> 
>> We have forgotten to mention the benefit of implementation simplicity.  If
>> the static/simple solution solves the problem for most of the use cases,
>> designing fancy dynamic solution sounds like over-engineering to me.  It
>> (overhead reduction technique) may also get in the way of easier functional
>> enhancement in the future.





More information about the llvm-dev mailing list