[PATCH] Make LLVM profiling thread-safe

Wed Jun 26 18:59:33 PDT 2013

On Jun 26, 2013, at 6:56 PM, Eric Christopher <echristo at gmail.com> wrote:

> On Wed, Jun 26, 2013 at 6:48 PM, Filip Pizlo <fpizlo at apple.com> wrote:
>> This is an interesting change.  The value of this change depends on the intent of the profiling.  If the profiling is merely intended to inform the compiler about what is hot, then it doesn't need to be so precise.  In particular, even without your patch, the profiling is in the right ballpark - it's likely enough for any optimizations that rely on execution counts.  So, for such optimizations, the only effect of your change is that profiling becomes significantly more costly. Atomic RMW operations ain't free!
>> 
>> OTOH, if you wanted to do an analysis that required exact counts for *some* execution, then of course your change is valuable.
>> 
>> My experience with profiling is of the former variety - I only care about ballpark counts.  So, from that standpoint, I would say that this patch isn't the right direction.
>> 
> 
> Hmm.. I think if we're instrumenting then we want full fidelity yes?
> Otherwise we may as well use an external sample based profiler yes?

Not necessarily.  There is a sliding scale of fidelity:

Totally accurate: Matthew's patch with atomic RMW counts.
Less accurate: racy counts (the status quo)
Even less accurate: sampling.

There is a huge gap in fidelity between racy counts and sampling.  Sampling may report zero information for code that did run but not often enough, while racy counts will still report counts for rarely-ran code, and those counts are likely to be within the ballpark of a totally accurate profiler.

That's a pretty big difference, I think.

-Filip

> 
> -eric
> 
>> -Filip
>> 
>> 
>> On Jun 26, 2013, at 6:28 PM, Matthew Dempsky <matthew at dempsky.org> wrote:
>> 
>>> On Mon, Jun 24, 2013 at 03:25:14PM -0700, Matthew Dempsky wrote:
>>>> On Mon, Jun 24, 2013 at 12:17:17PM -0700, Matthew Dempsky wrote:
>>>>> Instead of emitting separate read-modify-write instructions to
>>>>> increment the profiling counters, an monotonically ordered atomic add
>>>>> instruction should be emitted.
>>>> 
>>>> Oops, need to fix the tests too!
>>> 
>>> Ping?  Any interest in this fix?  (I'm not an LLVM committer, btw.)
>>> 
>>> To demonstrate this is an actual issue, I wrote this sample program:
>>> 
>>>   #include <pthread.h>
>>> 
>>>   void
>>>   noop(int i)
>>>   {
>>>   }
>>> 
>>>   void *
>>>   worker(void *arg)
>>>   {
>>>      int i, j;
>>> 
>>>      for (i = 0; i < 1000000; i++)
>>>              noop(i);
>>> 
>>>      return (NULL);
>>>   }
>>> 
>>>   int
>>>   main()
>>>   {
>>>      int i;
>>>      pthread_t thr[10];
>>> 
>>>      for (i = 0; i < 10; i++)
>>>              if (pthread_create(&thr[i], NULL, worker, NULL))
>>>                      return (1);
>>>      for (i = 0; i < 10; i++)
>>>              if (pthread_join(thr[i], NULL))
>>>                      return (1);
>>> 
>>>      return (0);
>>>   }
>>> 
>>> Obviously, noop() should be entered exactly 10 million times, but
>>> that's not what llvm-prof tells me currently:
>>> 
>>> $ clang -emit-llvm -c stress.c
>>> $ opt -insert-edge-profiling < stress.o > stress-edge.bc
>>> $ clang -o stress stress-edge.bc -L/usr/local/lib -lprofile_rt -lpthread
>>> /usr/local/lib/libprofile_rt.so: warning: strcpy() is almost always misused, please use strlcpy()
>>> /usr/local/lib/libprofile_rt.so: warning: strcat() is almost always misused, please use strlcat()
>>> $ rm -f llvmprof.out
>>> $ ./stress
>>> $ llvm-prof stress-edge.bc llvmprof.out
>>> ===-------------------------------------------------------------------------===
>>> LLVM profiling output for execution:
>>> 
>>> 
>>> ===-------------------------------------------------------------------------===
>>> Function execution frequencies:
>>> 
>>> ##   Frequency
>>> 1. 5.3e+06/5.31039e+06 noop
>>> 2.    10/5.31039e+06 worker
>>> 3.     1/5.31039e+06 main
>>> 
>>> ===-------------------------------------------------------------------------===
>>> Top 20 most frequently executed basic blocks:
>>> 
>>> ##      %%    Frequency
>>> 1. 25.9849% 5421453/2.08638e+07      worker() - for.inc
>>> 2. 25.4526% 5310376/2.08638e+07      noop() - entry
>>> 3. 24.4395% 5099022/2.08638e+07      worker() - for.body
>>> 4. 24.1225% 5032868/2.08638e+07      worker() - for.cond
>>> 5. 5.27228e-05%    11/2.08638e+07    main() - for.cond
>>> 6. 5.27228e-05%    11/2.08638e+07    main() - for.cond1
>>> 7. 4.79298e-05%    10/2.08638e+07    main() - for.inc10
>>> 8. 4.79298e-05%    10/2.08638e+07    main() - if.end9
>>> 9. 4.79298e-05%    10/2.08638e+07    main() - for.body3
>>> 10. 4.79298e-05%    10/2.08638e+07    main() - for.inc
>>> 11. 4.79298e-05%    10/2.08638e+07    main() - if.end
>>> 12. 4.79298e-05%    10/2.08638e+07    main() - for.body
>>> 13. 4.79298e-05%    10/2.08638e+07    worker() - for.end
>>> 14. 4.79298e-05%    10/2.08638e+07    worker() - entry
>>> 15. 4.79298e-06%     1/2.08638e+07    main() - for.end
>>> 16. 4.79298e-06%     1/2.08638e+07    main() - entry
>>> 17. 4.79298e-06%     1/2.08638e+07    main() - for.end12
>>> 18. 4.79298e-06%     1/2.08638e+07    main() - return
>>> 
>>> 
>>> With my patch, this is the output I get instead:
>>> 
>>> $ clang -emit-llvm -c stress.c
>>> $ opt -insert-edge-profiling < stress.o > stress-edge.bc
>>> $ clang -o stress stress-edge.bc -L/usr/local/lib -lprofile_rt -lpthread
>>> /usr/local/lib/libprofile_rt.so: warning: strcpy() is almost always misused, please use strlcpy()
>>> /usr/local/lib/libprofile_rt.so: warning: strcat() is almost always misused, please use strlcat()
>>> $ rm -f llvmprof.out
>>> $ ./stress
>>> $ llvm-prof stress-edge.bc llvmprof.out
>>> ===-------------------------------------------------------------------------===
>>> LLVM profiling output for execution:
>>> 
>>> 
>>> ===-------------------------------------------------------------------------===
>>> Function execution frequencies:
>>> 
>>> ##   Frequency
>>> 1. 1e+07/1e+07 noop
>>> 2.    10/1e+07 worker
>>> 3.     1/1e+07 main
>>> 
>>> ===-------------------------------------------------------------------------===
>>> Top 20 most frequently executed basic blocks:
>>> 
>>> ##      %%    Frequency
>>> 1.    25% 10000010/4.00001e+07       worker() - for.cond
>>> 2. 24.9999% 10000000/4.00001e+07     noop() - entry
>>> 3. 24.9999% 10000000/4.00001e+07     worker() - for.body
>>> 4. 24.9999% 10000000/4.00001e+07     worker() - for.inc
>>> 5. 2.74999e-05%    11/4.00001e+07    main() - for.cond
>>> 6. 2.74999e-05%    11/4.00001e+07    main() - for.cond1
>>> 7. 2.49999e-05%    10/4.00001e+07    main() - for.inc10
>>> 8. 2.49999e-05%    10/4.00001e+07    main() - if.end9
>>> 9. 2.49999e-05%    10/4.00001e+07    main() - for.body3
>>> 10. 2.49999e-05%    10/4.00001e+07    main() - for.inc
>>> 11. 2.49999e-05%    10/4.00001e+07    main() - if.end
>>> 12. 2.49999e-05%    10/4.00001e+07    main() - for.body
>>> 13. 2.49999e-05%    10/4.00001e+07    worker() - for.end
>>> 14. 2.49999e-05%    10/4.00001e+07    worker() - entry
>>> 15. 2.49999e-06%     1/4.00001e+07    main() - for.end
>>> 16. 2.49999e-06%     1/4.00001e+07    main() - entry
>>> 17. 2.49999e-06%     1/4.00001e+07    main() - for.end12
>>> 18. 2.49999e-06%     1/4.00001e+07    main() - return
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits