[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)

Fri Apr 25 10:11:40 PDT 2014

On 2014-Apr-23, at 7:31, Kostya Serebryany <kcc at google.com> wrote:

> I've run one proprietary benchmark that reflects a large portion of the google's server side code.
> -fprofile-instr-generate leads to 14x slowdown due to counter contention. That's serious. 
> Admittedly, there is a single hot function that accounts for half of that slowdown, 
> but even if I rebuild that function w/o -fprofile-instr-generate, the slowdown remains above 5x.
> This is not a toy code that I've written to prove my point -- this is real code one may want to profile with -fprofile-instr-generate.
> We need another approach for threaded code. 
> 
> There is another ungood feature of the current instrumentation. Consider this function: 
> std::vector<int> v(1000);
> void foo() { v[0] = 42; }
> 
> Here we have a single basic block and a call, but since the coverage is emitted by the 
> FE before inlining (and is also emitted for std::vector methods) we get this assembler at -O2:
> 0000000000400b90 <_Z3foov>:
>   400b90:       48 ff 05 11 25 20 00    incq   0x202511(%rip)        # 6030a8 <__llvm_profile_counters__Z3foov>
>   400b97:       48 ff 05 42 25 20 00    incq   0x202542(%rip)        # 6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
>   400b9e:       48 8b 05 4b 26 20 00    mov    0x20264b(%rip),%rax        # 6031f0 <v>
>   400ba5:       c7 00 2a 00 00 00       movl   $0x2a,(%rax)
>   400bab:       c3                      retq   
> 
> Suddenly, an innocent function that uses std::vector becomes a terrible point of contention.

I know you're just using std::vector<> as an example, but I think there
should be an option to avoid instrumenting the STL.  STL
implementations have typically been hand-tuned already.  I think the
benefit of PGO is extremely small there (although I have no numbers to
back this up.)

> Full test case below, -fprofile-instr-generate leads to 10x slowdown. 
> 
> =========================
> 
> Now, here is a more detailed proposal of logarithmic self-cooling counter mentioned before. Please comment. 
> The counter is a number of the form (2^k-1). 
> It starts with 0.
> After the first update it is 1.
> After *approximately* 1 more update it becomes 3
> After *approximately* 2 more updates it becomes 7
> After *approximately* 4 more updates it becomes 15
> ...
> After *approximately* 2^k more updates it becomes 2^(k+2)-1
> 
> The code would look like this:
>   if ((fast_thread_local_rand() & counter) == 0)
>     counter = 2 * counter + 1;
> 
> Possible implementation for fast_thread_local_rand: 
> long fast_thread_local_rand() {
>   static __thread long r;
>   return r++;
> }
> Although I would try to find something cheaper that this. (Ideas?)

Very cool.

> The counter is not precise (well, the current racy counters are not precise either).
> But statistically it should be no more than 2x away from the real counter. 
> Will this accuracy be enough for the major use cases? 

I really like this idea.  I think the loss of accuracy should be
opt-in (-fprofile-instr-generate=fuzzy), but I think this should be
available.

I think this could be cleanly integrated into -fprofile-instr-generate
without changing -fprofile-instr-use, compiler-rt, or llvm-profdata.
Having this option looks like a clear win to me.

> Moreover, this approach allows to implement the counter increment using a callback: 
>   if ((fast_thread_local_rand() & counter) == 0)   __cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC to map the counter to the source code. 
> (= no need to create separate section in the objects).

As Bob said already, the PC hack doesn't work for the frontend-based
instrumentation.  I'm skeptical even about relying on debug info for
retrieving function names.  But optimizing the binary size is a
separate topic anyway.