[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Dmitry Vyukov dvyukov at google.com
Fri Apr 25 10:39:16 PDT 2014


On Fri, Apr 25, 2014 at 9:28 PM, Bob Wilson <bob.wilson at apple.com> wrote:
>>>
>>> I can see that the behavior of our current instrumentation is going to be a
>>> problem for the kinds of applications that you’re looking at. If you can
>>> find a way to get the overhead down without losing accuracy
>>>
>>>
>>> What are your requirements for accuracy?
>>> Current implementation does not provide 100% accuracy, so it's
>>> something less than 100%. What is it?
>>> What use cases for numeric counters (as opposed to bool flag) do we
>>> need to support? Is it only feedback-driven optimizations?
>>>
>>>
>>> That’s a fair point. The current implementation is potentially inaccurate
>>> because the counter increments are not thread-safe. In a low-contention
>>> situation, that won’t matter much, but the counts could become quite
>>> inaccurate if there are multiple threads running the same code at the same
>>> time.
>>>
>>> I don’t have a specific goal in mind right now for accuracy. We plan to use
>>> this instrumentation for both PGO and code coverage. Some coverage users
>>> only care about a boolean check but others want to see the actual execution
>>> counts. If we display the count values and they are significantly different
>>> from the real execution counts, that will lead to much confusion.
>>
>> If I would look at coverage report with exact numbers and observe 2
>> statements that must have the same hit count, but in the report they
>> have different hit counts; I would be highly confused.
>> Go coverage profiler has 3 modes of operation: set, count and atomic:
>> http://golang.org/cmd/go/#hdr-Description_of_testing_flags
>> As you can guess, the last one (atomic) is intended for exactly such
>> case -- when user looks at exact numbers.
>
> I haven’t worked with Go, but that is exactly the kind of distinction that I had in mind when I suggested that we may want to support different kinds of instrumentation. It sounds like the “set” mode is pretty appealing for your target applications, and I won’t be at all surprised if we end up wanting something like “atomic” for some code coverage users. I’m guessing that the current -fprofile-instr-generate approach is similar to the “count” mode.

I think you are right.
Our compiler folks will correct me if I am wrong, but I would expect
that we also want something like "scalable-count' mode for FDO. The
14x slowdown that we've observed on a server app will render most of
our apps broken due to pervasive timeouts. This mode must provide
precision good enough for FDO and minimal performance impact for
highly multi-threaded apps.



>>> and without
>>> hurting the performance for applications without significant contention,
>>>
>>>
>>> What is the acceptable threshold? 0.01% would be fine, right? What is
>>> the maximum value that you are ready to agree with?
>>>
>>>
>>> Like I said, I don’t have a specific value in mind. My sense is that most of
>>> the applications we care about are quite different from the massively
>>> parallel code that Google cares about. We may have many threads but they’re
>>> all doing different things and contention is much less likely to be a
>>> problem.
>>
>>
>> Note that contention can arise even if threads are doing different
>> things but use common components. E.g. if you include a custom malloc
>> implementation into your program, then everything that allocates
>> becomes a point of contention.
>
> Yes, of course.
>
>>
>>
>>
>>> I really think we need to see specifics before we can decide anything here.
>>>
>>>
>>>
>>> then we can just adopt that. If you can’t do it without tradeoffs, then we
>>> should have a separate option to let those who care switch between different
>>> kinds of instrumentation.
>>>
>>> Using the PC to map to the source code is simply not going to work with
>>> -fprofile-instr-generate. The mapping from counter values to the
>>> user-visible execution counts is complex and relies on detailed knowledge of
>>> the clang ASTs.
>>>
>>>
>>>
>>> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
>>> TIME: real: 0.219; user: 0.430; system: 0.000
>>> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc && time
>>> ./a.out
>>> TIME: real: 3.743; user: 7.280; system: 0.000
>>>
>>> % cat coverage_mt_vec.cc
>>> #include <pthread.h>
>>> #include <vector>
>>>
>>> std::vector<int> v(1000);
>>>
>>> __attribute__((noinline)) void foo() { v[0] = 42; }
>>>
>>> void *Thread1(void *) {
>>> for (int i = 0; i < 100000000; i++)
>>>   foo();
>>> return 0;
>>> }
>>>
>>> __attribute__((noinline)) void bar() { v[999] = 66; }
>>>
>>> void *Thread2(void *) {
>>> for (int i = 0; i < 100000000; i++)
>>>   bar();
>>> return 0;
>>> }
>>>
>>> int main() {
>>> static const int kNumThreads = 16;
>>> pthread_t t[kNumThreads];
>>> pthread_create(&t[0], 0, Thread1, 0);
>>> pthread_create(&t[1], 0, Thread2, 0);
>>> pthread_join(t[0], 0);
>>> pthread_join(t[1], 0);
>>> return 0;
>>> }
>>>
>>>
>>>
>>>
>>> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>> This is long thread, so I will combine several comments into single
>>> email.
>>>
>>>
>>> - 8-bit per-thread counters, dumping into central counters on
>>> overflow.
>>>
>>> The overflow will happen very quickly with 8bit counter.
>>>
>>>
>>> Yes, but it reduces contention by 256x (a thread must execute at least
>>> 256 loop iterations between increments). In practice, if you reduce
>>> contention below some threshold, it does not represent a problem anymore.
>>>
>>>
>>>
>>> - per-thread counters. Solves the problem at huge cost in RAM
>>> per-thread
>>>
>>> It is not practical. Especially for TLS counters -- it creates huge
>>> pressure on stack memory.
>>>
>>>
>>> Do we have any numbers about number of counters? If there are 100K 1-byte
>>> counters, I would consider it as practical.
>>>
>>>
>>> A medium sized app I looked at has about 10M counters (arcs only).  It is
>>> also not uncommon to see such apps running with hundreds of threads.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> In Google GCC, we implemented another technique which proves to be very
>>> effective -- it is called FDO sampling.
>>> Basically counters will be updated every N samples.
>>>
>>>
>>> How does it work?
>>>
>>>
>>>
>>> Similar to how occurrences based PMU sampling work. Setting sampling
>>> period to 100 can reduce the instrumentation overhead by close to 100x
>>> without introducing much precision loss.
>>>
>>>
>>>
>>>
>>>
>>> It seems to me like we’re going to have a hard time getting good
>>> multithreaded performance without significant impact on the single-threaded
>>> behavior.
>>>
>>> I don't really agree.
>>>
>>>
>>>
>>> We are talking about developers here. Nobody would know the exact thread
>>> counts, but developers know the ballpark number
>>>
>>>
>>> I strongly believe that we must relief developers from this choice during
>>> build time, and do our best to auto-tune (if the final scheme requires
>>> tuning).
>>>
>>>
>>>
>>>
>>> That really depends.   If the tuning space is small, it won't be a problem
>>> for the developer/builder.
>>>
>>>
>>>
>>> First, such questions puts unnecessary burden on developers. We don't ask
>>> what register allocation algorithm to use for each function, right?
>>>
>>>
>>>
>>> Crazy developers can actually do that via internal options, but this is
>>> totally different case.  People just needs one flag to turn on/off sharding.
>>> When sharding is on, compiler can pick the best 'N' according to some
>>> heuristics at compile time.
>>>
>>>
>>>
>>> Second, there are significant chances we will get a wrong answer, because
>>> e.g. developer's view of how threaded is the app can differ from reality or
>>> from our classification.
>>> Third, the app can be build by a build engineer; or the procedure can be
>>> applied to a base with 10'000 apps; or threaded-ness can change; or the app
>>> can work in different modes; or the app can have different phases.
>>>
>>>
>>> We have forgotten to mention the benefit of implementation simplicity.  If
>>> the static/simple solution solves the problem for most of the use cases,
>>> designing fancy dynamic solution sounds like over-engineering to me.  It
>>> (overhead reduction technique) may also get in the way of easier functional
>>> enhancement in the future.
>




More information about the llvm-dev mailing list