[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Duncan P. N. Exon Smith
dexonsmith at apple.com
Fri Apr 25 10:11:40 PDT 2014
On 2014-Apr-23, at 7:31, Kostya Serebryany <kcc at google.com> wrote:
> I've run one proprietary benchmark that reflects a large portion of the google's server side code.
> -fprofile-instr-generate leads to 14x slowdown due to counter contention. That's serious.
> Admittedly, there is a single hot function that accounts for half of that slowdown,
> but even if I rebuild that function w/o -fprofile-instr-generate, the slowdown remains above 5x.
> This is not a toy code that I've written to prove my point -- this is real code one may want to profile with -fprofile-instr-generate.
> We need another approach for threaded code.
>
> There is another ungood feature of the current instrumentation. Consider this function:
> std::vector<int> v(1000);
> void foo() { v[0] = 42; }
>
> Here we have a single basic block and a call, but since the coverage is emitted by the
> FE before inlining (and is also emitted for std::vector methods) we get this assembler at -O2:
> 0000000000400b90 <_Z3foov>:
> 400b90: 48 ff 05 11 25 20 00 incq 0x202511(%rip) # 6030a8 <__llvm_profile_counters__Z3foov>
> 400b97: 48 ff 05 42 25 20 00 incq 0x202542(%rip) # 6030e0 <__llvm_profile_counters__ZNSt6vectorIiSaIiEEixEm>
> 400b9e: 48 8b 05 4b 26 20 00 mov 0x20264b(%rip),%rax # 6031f0 <v>
> 400ba5: c7 00 2a 00 00 00 movl $0x2a,(%rax)
> 400bab: c3 retq
>
> Suddenly, an innocent function that uses std::vector becomes a terrible point of contention.
I know you're just using std::vector<> as an example, but I think there
should be an option to avoid instrumenting the STL. STL
implementations have typically been hand-tuned already. I think the
benefit of PGO is extremely small there (although I have no numbers to
back this up.)
> Full test case below, -fprofile-instr-generate leads to 10x slowdown.
>
> =========================
>
> Now, here is a more detailed proposal of logarithmic self-cooling counter mentioned before. Please comment.
> The counter is a number of the form (2^k-1).
> It starts with 0.
> After the first update it is 1.
> After *approximately* 1 more update it becomes 3
> After *approximately* 2 more updates it becomes 7
> After *approximately* 4 more updates it becomes 15
> ...
> After *approximately* 2^k more updates it becomes 2^(k+2)-1
>
> The code would look like this:
> if ((fast_thread_local_rand() & counter) == 0)
> counter = 2 * counter + 1;
>
> Possible implementation for fast_thread_local_rand:
> long fast_thread_local_rand() {
> static __thread long r;
> return r++;
> }
> Although I would try to find something cheaper that this. (Ideas?)
Very cool.
> The counter is not precise (well, the current racy counters are not precise either).
> But statistically it should be no more than 2x away from the real counter.
> Will this accuracy be enough for the major use cases?
I really like this idea. I think the loss of accuracy should be
opt-in (-fprofile-instr-generate=fuzzy), but I think this should be
available.
I think this could be cleanly integrated into -fprofile-instr-generate
without changing -fprofile-instr-use, compiler-rt, or llvm-profdata.
Having this option looks like a clear win to me.
> Moreover, this approach allows to implement the counter increment using a callback:
> if ((fast_thread_local_rand() & counter) == 0) __cov_increment(&counter);
> which in turn will let us use the same hack as in AsanCoverage: use the PC to map the counter to the source code.
> (= no need to create separate section in the objects).
As Bob said already, the PC hack doesn't work for the frontend-based
instrumentation. I'm skeptical even about relying on debug info for
retrieving function names. But optimizing the binary size is a
separate topic anyway.
More information about the llvm-dev
mailing list