<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 21, 2015 at 11:38 PM, Dmitry Vyukov <span dir="ltr"><<a href="mailto:dvyukov@google.com" target="_blank">dvyukov@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":347" class="a3s" style="overflow:hidden">You are right.<br>

But this optimization is too fruitful to just discard it. So fruitful<br>

(-40% of instrumentation in a large webrtc test) that I am inclining<br>

towards ignoring the possibility of passing the object using relaxed<br>

atomics... On the other hand people do mess memory ordering, so losing<br>

these races is pity as well...<br></div></blockquote><div><br></div><div>It would make me very sad to lose this feature of TSan. Of all the subtle racy-queue techniques I have seen or heard of over the years, the one I cited is actually one of the few that I have seen debugged specifically through the use of TSan.</div><div><br></div><div>I also fear losing it in small part because it is a specific portability risk between x86 and weak memory architectures, one of the biggest features of TSan for me.</div><div><br></div><div>But it's wild that this is 40% of the instrumentation in a large webrtc test. That seems to clearly indicate that there is *something* to be done here, but I don't know yet what that is... so:</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":347" class="a3s" style="overflow:hidden">

Maybe we can figure out a way to get both at least in most cases.</div></blockquote><div><br></div><div>That would be my hope as well. =]</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":347" class="a3s" style="overflow:hidden"> Few<br>

observations:<br>

1. Leaking of stack objects to other threads is very infrequent (I<br>

would say 1%).<br></div></blockquote><div><br></div><div>Infrequent relative to *captured* stack objects? Yes, but I'm not sure how infrequent really. Mostly this is because I expect most stack objects to never be captured.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":347" class="a3s" style="overflow:hidden">

2. Whole stack is generally touched by the current thread, so there<br>

are chances that we will still detect the race.<br></div></blockquote><div><br></div><div>I'm not really sure why? The cases I have seen this (and I went looking and found a few others) all look like fresh stack allocations that are written to and then passed off to another thread. I wouldn't expect (in common cases) for anything else to write to the stack without there being some "synchronization" that hides any race.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":347" class="a3s" style="overflow:hidden">

3. I suspect that just deduplicating memory accesses before capturing<br>

will remove very few accesses, because the accesses before capturing<br>

are the initial initialization of the object (write each member var<br>

only once).<br></div></blockquote><div><br></div><div>I wasn't sure what the cost function of the tsan instrumentation was, and whether it is significantly cheaper to do a single instrumentation call for N*M bytes or N calls for M bytes each.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":347" class="a3s" style="overflow:hidden">

4. Perhaps we can tsan-write whole stack frame in function prologue<br>

using __tsan_write8, and then ignore individual accesses before<br>

capturing. This is somewhat similar to 3, but may be more efficient<br>

(uses 8-byte writes and write once instead of on each iteration of a<br>

loop). Is it possible to do in llvm? Do you think it is sufficient?</div></blockquote><div><br></div><div>Interesting. This definitely seems possible. I think this is very similar to my idea for #3 -- essentially to coalesce all instrumentation of all static (non-dynamic in the LLVM IR lingo)</div><div><br></div><div>But I think this is a more general idea. There is a *lot* that we could do to coalesce instrumentation and reduce it, moving it outside of loops, etc.</div><div><br></div><div>I think it would be useful to actually look at the specific instrumentation patterns that are coming up most frequently in real world code patterns...</div></div></div></div>