[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Thu Apr 24 01:33:54 PDT 2014

On Wed, Apr 23, 2014 at 10:48 PM, Bob Wilson <bob.wilson at apple.com> wrote:
> On Apr 23, 2014, at 7:31 AM, Kostya Serebryany <kcc at google.com> wrote:
>
> I've run one proprietary benchmark that reflects a large portion of the
> google's server side code.
> -fprofile-instr-generate leads to 14x slowdown due to counter contention.
> That's serious.
> Admittedly, there is a single hot function that accounts for half of that
> slowdown,
> but even if I rebuild that function w/o -fprofile-instr-generate, the
> slowdown remains above 5x.
> This is not a toy code that I've written to prove my point -- this is real
> code one may want to profile with -fprofile-instr-generate.
> We need another approach for threaded code.
>
> There is another ungood feature of the current instrumentation. Consider
> this function:
> std::vector<int> v(1000);
> void foo() { v[0] = 42; }
>
> Here we have a single basic block and a call, but since the coverage is
> emitted by the
> FE before inlining (and is also emitted for std::vector methods) we get this
> assembler at -O2:
> 0000000000400b90 <_Z3foov>:
>   400b90:       48 ff 05 11 25 20 00    incq   0x202511(%rip)        #
> 6030a8 <__llvm_profile_counters__Z3foov>
>   400b97:       48 ff 05 42 25 20 00    incq   0x202542(%rip)        #
> 6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
>   400b9e:       48 8b 05 4b 26 20 00    mov    0x20264b(%rip),%rax        #
> 6031f0 <v>
>   400ba5:       c7 00 2a 00 00 00       movl   $0x2a,(%rax)
>   400bab:       c3                      retq
>
> Suddenly, an innocent function that uses std::vector becomes a terrible
> point of contention.
> Full test case below, -fprofile-instr-generate leads to 10x slowdown.
>
> =========================
>
> Now, here is a more detailed proposal of logarithmic self-cooling counter
> mentioned before. Please comment.
> The counter is a number of the form (2^k-1).
> It starts with 0.
> After the first update it is 1.
> After *approximately* 1 more update it becomes 3
> After *approximately* 2 more updates it becomes 7
> After *approximately* 4 more updates it becomes 15
> ...
> After *approximately* 2^k more updates it becomes 2^(k+2)-1
>
> The code would look like this:
>   if ((fast_thread_local_rand() & counter) == 0)
>     counter = 2 * counter + 1;
>
> Possible implementation for fast_thread_local_rand:
> long fast_thread_local_rand() {
>   static __thread long r;
>   return r++;
> }
> Although I would try to find something cheaper that this. (Ideas?)
>
>
> The counter is not precise (well, the current racy counters are not precise
> either).
> But statistically it should be no more than 2x away from the real counter.
> Will this accuracy be enough for the major use cases?
>
> Moreover, this approach allows to implement the counter increment using a
> callback:
>   if ((fast_thread_local_rand() & counter) == 0)
> __cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC
> to map the counter to the source code.
> (= no need to create separate section in the objects).
>
> Thoughts?
>
> —kcc
>
>
> I can see that the behavior of our current instrumentation is going to be a
> problem for the kinds of applications that you’re looking at. If you can
> find a way to get the overhead down without losing accuracy

What are your requirements for accuracy?
Current implementation does not provide 100% accuracy, so it's
something less than 100%. What is it?
What use cases for numeric counters (as opposed to bool flag) do we
need to support? Is it only feedback-driven optimizations?

> and without
> hurting the performance for applications without significant contention,

What is the acceptable threshold? 0.01% would be fine, right? What is
the maximum value that you are ready to agree with?

> then we can just adopt that. If you can’t do it without tradeoffs, then we
> should have a separate option to let those who care switch between different
> kinds of instrumentation.
>
> Using the PC to map to the source code is simply not going to work with
> -fprofile-instr-generate. The mapping from counter values to the
> user-visible execution counts is complex and relies on detailed knowledge of
> the clang ASTs.
>
>
>
> % clang++ -O2 -lpthread coverage_mt_vec.cc && time ./a.out
> TIME: real: 0.219; user: 0.430; system: 0.000
> % clang++ -O2 -lpthread -fprofile-instr-generate coverage_mt_vec.cc && time
> ./a.out
> TIME: real: 3.743; user: 7.280; system: 0.000
>
> % cat coverage_mt_vec.cc
> #include <pthread.h>
> #include <vector>
>
> std::vector<int> v(1000);
>
> __attribute__((noinline)) void foo() { v[0] = 42; }
>
> void *Thread1(void *) {
>   for (int i = 0; i < 100000000; i++)
>     foo();
>   return 0;
> }
>
> __attribute__((noinline)) void bar() { v[999] = 66; }
>
> void *Thread2(void *) {
>   for (int i = 0; i < 100000000; i++)
>     bar();
>   return 0;
> }
>
> int main() {
>   static const int kNumThreads = 16;
>   pthread_t t[kNumThreads];
>   pthread_create(&t[0], 0, Thread1, 0);
>   pthread_create(&t[1], 0, Thread2, 0);
>   pthread_join(t[0], 0);
>   pthread_join(t[1], 0);
>   return 0;
> }
>
>
>
>
> On Fri, Apr 18, 2014 at 11:45 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 12:13 AM, Dmitry Vyukov <dvyukov at google.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> This is long thread, so I will combine several comments into single
>>> email.
>>>
>>>
>>> >> - 8-bit per-thread counters, dumping into central counters on
>>> >> overflow.
>>> >The overflow will happen very quickly with 8bit counter.
>>>
>>> Yes, but it reduces contention by 256x (a thread must execute at least
>>> 256 loop iterations between increments). In practice, if you reduce
>>> contention below some threshold, it does not represent a problem anymore.
>>>
>>>
>>>
>>> >> - per-thread counters. Solves the problem at huge cost in RAM
>>> >> per-thread
>>> >It is not practical. Especially for TLS counters -- it creates huge
>>> > pressure on stack memory.
>>>
>>> Do we have any numbers about number of counters? If there are 100K 1-byte
>>> counters, I would consider it as practical.
>>>
>>
>> A medium sized app I looked at has about 10M counters (arcs only).  It is
>> also not uncommon to see such apps running with hundreds of threads.
>>
>>
>>>
>>>
>>>
>>>
>>>
>>> > In Google GCC, we implemented another technique which proves to be very
>>> > effective -- it is called FDO sampling.
>>> > Basically counters will be updated every N samples.
>>>
>>> How does it work?
>>
>>
>> Similar to how occurrences based PMU sampling work. Setting sampling
>> period to 100 can reduce the instrumentation overhead by close to 100x
>> without introducing much precision loss.
>>
>>>
>>>
>>>
>>>
>>> >> It seems to me like we’re going to have a hard time getting good
>>> >> multithreaded performance without significant impact on the single-threaded
>>> >> behavior.
>>> > I don't really agree.
>>>
>>>
>>> >We are talking about developers here. Nobody would know the exact thread
>>> > counts, but developers know the ballpark number
>>>
>>> I strongly believe that we must relief developers from this choice during
>>> build time, and do our best to auto-tune (if the final scheme requires
>>> tuning).
>>
>>
>>
>> That really depends.   If the tuning space is small, it won't be a problem
>> for the developer/builder.
>>
>>
>>>
>>> First, such questions puts unnecessary burden on developers. We don't ask
>>> what register allocation algorithm to use for each function, right?
>>
>>
>> Crazy developers can actually do that via internal options, but this is
>> totally different case.  People just needs one flag to turn on/off sharding.
>> When sharding is on, compiler can pick the best 'N' according to some
>> heuristics at compile time.
>>
>>
>>>
>>> Second, there are significant chances we will get a wrong answer, because
>>> e.g. developer's view of how threaded is the app can differ from reality or
>>> from our classification.
>>> Third, the app can be build by a build engineer; or the procedure can be
>>> applied to a base with 10'000 apps; or threaded-ness can change; or the app
>>> can work in different modes; or the app can have different phases.
>>>
>>
>> We have forgotten to mention the benefit of implementation simplicity.  If
>> the static/simple solution solves the problem for most of the use cases,
>> designing fancy dynamic solution sounds like over-engineering to me.  It
>> (overhead reduction technique) may also get in the way of easier functional
>> enhancement in the future.
>>
>>  David
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>