[PATCH] tsan: do not instrument not captured values

Wed Jan 21 23:38:59 PST 2015

On Thu, Jan 22, 2015 at 4:40 AM, Chandler Carruth <chandlerc at google.com> wrote:
>
> On Tue, Jan 20, 2015 at 8:54 AM, Dmitry Vyukov <dvyukov at google.com> wrote:
>>
>> Hi kcc, samsonov, eugenis,
>>
>> Consider the following code:
>>
>> P p;
>> p.x = 1;
>> p.y = 2;
>> foo(&p);
>>
>> P escapes and so tsan instruments accesses to it. However, before address
>> of the variable is actually taken, it is not shared and so cannot
>> participate in data races. So there is no point in instrumenting these
>> initial memory accesses.
>
>
> There can be races here.
>
> Imagine:
>
> foo(P *p) {
>   racy_queue.relaxed_atomic_enqueue(p);
> }
>
> and then thread B:
>
> bar() {
>   P *p = racy_queue.relaxed_atomic_dequeue();
>   read(p.x);
>   ...
> }
>
> Here we don't have a race in my racy_queue because internally it uses
> relaxed atomics. However it never establishes a happens-before edge, and so
> the code in bar that reads this should be flagged as a race by TSan.
>
> This isn't just a hypothetical, I have seen TSan catch this in more than one
> project that had a bad lock-free queue implementation. This is especially
> dangerous on x86 because the machine instructions used for relaxed store and
> load are actually enough to establish the necessary memory ordering
> guarantee.
>
> While we should definitely coalesce the TSan instrumentation to be a single
> instrumented write preceding the capture, I don't think we should remove
> this entirely.

You are right.
But this optimization is too fruitful to just discard it. So fruitful
(-40% of instrumentation in a large webrtc test) that I am inclining
towards ignoring the possibility of passing the object using relaxed
atomics... On the other hand people do mess memory ordering, so losing
these races is pity as well...
Maybe we can figure out a way to get both at least in most cases. Few
observations:
1. Leaking of stack objects to other threads is very infrequent (I
would say 1%).
2. Whole stack is generally touched by the current thread, so there
are chances that we will still detect the race.
3. I suspect that just deduplicating memory accesses before capturing
will remove very few accesses, because the accesses before capturing
are the initial initialization of the object (write each member var
only once).
4. Perhaps we can tsan-write whole stack frame in function prologue
using __tsan_write8, and then ignore individual accesses before
capturing. This is somewhat similar to 3, but may be more efficient
(uses 8-byte writes and write once instead of on each iteration of a
loop). Is it possible to do in llvm? Do you think it is sufficient?