<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Nov 21, 2013 at 11:14 AM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="im">----- Original Message -----<br>

> From: "Kostya Serebryany" <<a href="mailto:kcc@google.com">kcc@google.com</a>><br>

</div><div class="im">> To: "David Chisnall" <<a href="mailto:David.Chisnall@cl.cam.ac.uk">David.Chisnall@cl.cam.ac.uk</a>><br>

> Cc: "LLVM Developers Mailing List" <<a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>><br>

</div><div class="im">> Sent: Tuesday, November 19, 2013 11:01:12 PM<br>

> Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with<br>

> MemorySanitizer)<br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

><br>

</div><div><div class="h5">> On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <<br>

> <a href="mailto:David.Chisnall@cl.cam.ac.uk">David.Chisnall@cl.cam.ac.uk</a> > wrote:<br>

><br>

><br>

><br>

> On 19 Nov 2013, at 17:58, Kostya Serebryany < <a href="mailto:kcc@google.com">kcc@google.com</a> > wrote:<br>

><br>

> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <<br>

> > <a href="mailto:michael.m.kuperstein@intel.com">michael.m.kuperstein@intel.com</a> > wrote:<br>

> >> What I’m trying to say is that according to my understanding of<br>

> >> the C++11 memory model, even in that small reproducer, the store<br>

> >> to g and the load from g are in fact a data race.<br>

> >><br>

> >> (This is regardless of the fact the load is protected by a branch<br>

> >> that is not taken.)<br>

> ><br>

> > My understanding of the standard is quite the opposite.<br>

><br>

> I agree. It is a potential data race, if the branch is taken then it<br>

> definitely is a race because the standard explicitly does not define<br>

> an ordering on loads and stores of non-atomic variables unless some<br>

> happens-before relationship is established by some other operation<br>

> (which does not occur in this example). In the case where the branch<br>

> is not taken, then there is no data race because in the C++ abstract<br>

> machine there is no load, whether there is one in the underlying<br>

> concrete machine or not.<br>

><br>

> I believe that this transform is valid, because there are only two<br>

> possible cases here:<br>

><br>

> - Some other code has established a happens-before relationship<br>

> between the load and the stores, giving a well-defined ordering,<br>

> however this code is not visible in the function foo() and so the<br>

> load may happen anywhere.<br>

><br>

> - No other code establishes a happens-before relationship, and<br>

> therefore the order of the load and the store is completely<br>

> undefined and as long as the load doesn't trap it is completely safe<br>

> to hoist it.<br>

><br>

> HOWEVER, I would dispute describing this transform as an optimisation<br>

> in this case.<br>

><br>

><br>

> You have a good point: evaluating such transformation on<br>

> single-threaded benchmarks (e.g. SPEC) makes no sense<br>

> if we care about multi-threaded code. And evaluating performance of<br>

> anything like this on a multi-threaded code is hard too.<br>

> And it's very easy to prepare a synthetic test case where this<br>

> optimization will cause 4x slowdown (see below).<br>

<br>

</div></div>Out of curiosity, can you explain why? Is this cache-line bouncing? Or just an increase in overall memory-bandwidth use?<br></blockquote><div><br></div><div>Exactly, this is cache-line bouncing</div><div> </div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span class=""><font color="#888888"><br>

 -Hal<br>

</font></span><div class=""><div class="h5"><br>

> But if we disable this transformation someone surely will come and<br>

> complain :)<br>

> Another data point: gcc 4.6.3 does this too.<br>

><br>

><br>

><br>

><br>

><br>

> ==> if-bench.cc <==<br>

> #include <pthread.h><br>

> #include <unistd.h><br>

><br>

><br>

> extern int foo(int);<br>

> extern void bar(int);<br>

><br>

><br>

> #ifndef N<br>

> #define N (1UL << 30)<br>

> #endif<br>

><br>

><br>

> void *Thread(void *p) {<br>

> for (long i = 0; i < N; i++)<br>

> foo(0);<br>

> return 0;<br>

> }<br>

><br>

><br>

> int main() {<br>

> pthread_t t[3];<br>

> pthread_create(&t[0], 0, Thread, 0);<br>

> pthread_create(&t[1], 0, Thread, 0);<br>

> pthread_create(&t[2], 0, Thread, 0);<br>

> for (long i = 0; i < N; i++)<br>

> bar(i);<br>

> pthread_join(t[0], 0);<br>

> pthread_join(t[1], 0);<br>

> pthread_join(t[2], 0);<br>

> }<br>

><br>

><br>

> ==> g.cc <==<br>

> int g;<br>

><br>

><br>

> int foo(int cond) {<br>

> if (cond)<br>

> return g;<br>

> // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.<br>

> return 0;<br>

> }<br>

><br>

><br>

> void bar(int i) { g = i; }<br>

><br>

><br>

><br>

><br>

><br>

> % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out<br>

><br>

><br>

><br>

><br>

> --kcc<br>

><br>

><br>

><br>

><br>

> If the two threads are running on different cores, then we have<br>

> likely introduced some false sharing and have forced the two cores<br>

> to exchange some bus traffic for cache coherency, even though is is<br>

> completely valid in the C++ abstract machine model for the load of g<br>

> to return a stale cache value. This is only a real problem on<br>

> strongly ordered architectures (e.g. x86), but given the relative<br>

> cost of a cache shootdown and everything else in this test case<br>

> (with the exception of the thread creation), I wouldn't be surprised<br>

> if it ended up slowing things down. Especially given that a modern<br>

> superscalar CPU will speculatively execute the load ANYWAY if it can<br>

> do so from cache, and if it can't then the performance improvement<br>

> from doing it before the branch will likely be negligible.<br>

><br>

> For single-core, in-order, single-issue architectures, or multicore,<br>

> weakly ordered, in-order, single-issue architectures, it's probably<br>

> a win...<br>

><br>

> David<br>

><br>

><br>

><br>

</div></div><div class=""><div class="h5">> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

><br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</div></div></blockquote></div><br></div></div>