[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

Wed Nov 20 23:14:40 PST 2013

----- Original Message -----
> From: "Kostya Serebryany" <kcc at google.com>
> To: "David Chisnall" <David.Chisnall at cl.cam.ac.uk>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Tuesday, November 19, 2013 11:01:12 PM
> Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with
> MemorySanitizer)
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
> David.Chisnall at cl.cam.ac.uk > wrote:
> 
> 
> 
> On 19 Nov 2013, at 17:58, Kostya Serebryany < kcc at google.com > wrote:
> 
> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
> > michael.m.kuperstein at intel.com > wrote:
> >> What I’m trying to say is that according to my understanding of
> >> the C++11 memory model, even in that small reproducer, the store
> >> to g and the load from g are in fact a data race.
> >> 
> >> (This is regardless of the fact the load is protected by a branch
> >> that is not taken.)
> > 
> > My understanding of the standard is quite the opposite.
> 
> I agree. It is a potential data race, if the branch is taken then it
> definitely is a race because the standard explicitly does not define
> an ordering on loads and stores of non-atomic variables unless some
> happens-before relationship is established by some other operation
> (which does not occur in this example). In the case where the branch
> is not taken, then there is no data race because in the C++ abstract
> machine there is no load, whether there is one in the underlying
> concrete machine or not.
> 
> I believe that this transform is valid, because there are only two
> possible cases here:
> 
> - Some other code has established a happens-before relationship
> between the load and the stores, giving a well-defined ordering,
> however this code is not visible in the function foo() and so the
> load may happen anywhere.
> 
> - No other code establishes a happens-before relationship, and
> therefore the order of the load and the store is completely
> undefined and as long as the load doesn't trap it is completely safe
> to hoist it.
> 
> HOWEVER, I would dispute describing this transform as an optimisation
> in this case.
> 
> 
> You have a good point: evaluating such transformation on
> single-threaded benchmarks (e.g. SPEC) makes no sense
> if we care about multi-threaded code. And evaluating performance of
> anything like this on a multi-threaded code is hard too.
> And it's very easy to prepare a synthetic test case where this
> optimization will cause 4x slowdown (see below).

Out of curiosity, can you explain why? Is this cache-line bouncing? Or just an increase in overall memory-bandwidth use?

 -Hal

> But if we disable this transformation someone surely will come and
> complain :)
> Another data point: gcc 4.6.3 does this too.
> 
> 
> 
> 
> 
> ==> if-bench.cc <==
> #include <pthread.h>
> #include <unistd.h>
> 
> 
> extern int foo(int);
> extern void bar(int);
> 
> 
> #ifndef N
> #define N (1UL << 30)
> #endif
> 
> 
> void *Thread(void *p) {
> for (long i = 0; i < N; i++)
> foo(0);
> return 0;
> }
> 
> 
> int main() {
> pthread_t t[3];
> pthread_create(&t[0], 0, Thread, 0);
> pthread_create(&t[1], 0, Thread, 0);
> pthread_create(&t[2], 0, Thread, 0);
> for (long i = 0; i < N; i++)
> bar(i);
> pthread_join(t[0], 0);
> pthread_join(t[1], 0);
> pthread_join(t[2], 0);
> }
> 
> 
> ==> g.cc <==
> int g;
> 
> 
> int foo(int cond) {
> if (cond)
> return g;
> // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
> return 0;
> }
> 
> 
> void bar(int i) { g = i; }
> 
> 
> 
> 
> 
> % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out
> 
> 
> 
> 
> --kcc
> 
> 
> 
> 
> If the two threads are running on different cores, then we have
> likely introduced some false sharing and have forced the two cores
> to exchange some bus traffic for cache coherency, even though is is
> completely valid in the C++ abstract machine model for the load of g
> to return a stale cache value. This is only a real problem on
> strongly ordered architectures (e.g. x86), but given the relative
> cost of a cache shootdown and everything else in this test case
> (with the exception of the thread creation), I wouldn't be surprised
> if it ended up slowing things down. Especially given that a modern
> superscalar CPU will speculatively execute the load ANYWAY if it can
> do so from cache, and if it can't then the performance improvement
> from doing it before the branch will likely be negligible.
> 
> For single-core, in-order, single-issue architectures, or multicore,
> weakly ordered, in-order, single-issue architectures, it's probably
> a win...
> 
> David
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory