[PATCH] Make LLVM profiling thread-safe

Thu Jun 27 14:04:11 PDT 2013

On Jun 27, 2013, at 12:55 PM, Matthew Dempsky <matthew at dempsky.org> wrote:

> On Wed, Jun 26, 2013 at 7:01 PM, Eric Christopher <echristo at gmail.com> wrote:
>> Oh totally. I don't have enough of a feel here to know what we should expect.
> 
> Seeing as there's not much documentation on how to use it or what it
> does, I imagine *most* people don't know what to expect. :P
> 
>> Anyone know what other compilers do in this space? What consumers of
>> the data would expect? Maybe Matthew has a use case that'll be
>> instructive here of what he was trying to accomplish? :)
> 
> My immediate use case is fuzzing programs guided by edge coverage, for
> which I really only care about distinguishing zero/non-zero values, so
> atomic increments aren't even necessary for me since racing increments
> will still at least leave a non-zero value.
> 
> But generally, I just expect tools to provide correct/precise behavior
> by default, and then provide options to trade off between
> correctness/precision for performance when necessary.  

If a profiler significantly perturbs execution time, it can also perturb behavior, particularly in concurrent/parallel code.  Hence any instrumentation carries imprecision.  Reducing overhead also improves precision by reducing the likelihood of the code's behavior being significantly perturbed.

> E.g., on x86-64
> there's no extra cost for atomic increments, so they should always be
> used.

Atomic increments are significantly more expensive (in terms of execution time) than non-atomic increments on x86-64.  It's at least a 4x difference on recent Intel hardware (some kind of i7).  It's true that they don't require significantly more instructions (you just need a lock# prefix) but that's not the whole story.  Execution time matters in profiling.

> 
> On ARM, it costs an extra compare and conditional branch instruction
> to change the load/increment/store into a LL/SC loop (but no memory
> barrier).  ARM's description of LDREX/STREX and exclusive monitors
> seems too imprecise to indicate whether looping is necessary for
> regular contention or just for context switches/etc, but it seems to
> imply the latter to me (otherwise two LL/SC loops could live-lock).
> 
> Either way, if someone demonstrates precise profiling slows down their
> code too much because of the atomic increments, it shouldn't be hard
> to add an ApproximateEdgeProfiling pass variant or something.
> 
> My 2c. :)