[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
    Kostya Serebryany 
    kcc at google.com
       
    Wed Nov 20 23:38:15 PST 2013
    
    
  
On Thu, Nov 21, 2013 at 11:14 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
> > From: "Kostya Serebryany" <kcc at google.com>
> > To: "David Chisnall" <David.Chisnall at cl.cam.ac.uk>
> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> > Sent: Tuesday, November 19, 2013 11:01:12 PM
> > Subject: Re: [LLVMdev] Curiosity about transform changes under
> Sanitizers (Was: [PATCH] Disable branch folding with
> > MemorySanitizer)
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
> > David.Chisnall at cl.cam.ac.uk > wrote:
> >
> >
> >
> > On 19 Nov 2013, at 17:58, Kostya Serebryany < kcc at google.com > wrote:
> >
> > > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
> > > michael.m.kuperstein at intel.com > wrote:
> > >> What I’m trying to say is that according to my understanding of
> > >> the C++11 memory model, even in that small reproducer, the store
> > >> to g and the load from g are in fact a data race.
> > >>
> > >> (This is regardless of the fact the load is protected by a branch
> > >> that is not taken.)
> > >
> > > My understanding of the standard is quite the opposite.
> >
> > I agree. It is a potential data race, if the branch is taken then it
> > definitely is a race because the standard explicitly does not define
> > an ordering on loads and stores of non-atomic variables unless some
> > happens-before relationship is established by some other operation
> > (which does not occur in this example). In the case where the branch
> > is not taken, then there is no data race because in the C++ abstract
> > machine there is no load, whether there is one in the underlying
> > concrete machine or not.
> >
> > I believe that this transform is valid, because there are only two
> > possible cases here:
> >
> > - Some other code has established a happens-before relationship
> > between the load and the stores, giving a well-defined ordering,
> > however this code is not visible in the function foo() and so the
> > load may happen anywhere.
> >
> > - No other code establishes a happens-before relationship, and
> > therefore the order of the load and the store is completely
> > undefined and as long as the load doesn't trap it is completely safe
> > to hoist it.
> >
> > HOWEVER, I would dispute describing this transform as an optimisation
> > in this case.
> >
> >
> > You have a good point: evaluating such transformation on
> > single-threaded benchmarks (e.g. SPEC) makes no sense
> > if we care about multi-threaded code. And evaluating performance of
> > anything like this on a multi-threaded code is hard too.
> > And it's very easy to prepare a synthetic test case where this
> > optimization will cause 4x slowdown (see below).
>
> Out of curiosity, can you explain why? Is this cache-line bouncing? Or
> just an increase in overall memory-bandwidth use?
>
Exactly, this is cache-line bouncing
>
>  -Hal
>
> > But if we disable this transformation someone surely will come and
> > complain :)
> > Another data point: gcc 4.6.3 does this too.
> >
> >
> >
> >
> >
> > ==> if-bench.cc <==
> > #include <pthread.h>
> > #include <unistd.h>
> >
> >
> > extern int foo(int);
> > extern void bar(int);
> >
> >
> > #ifndef N
> > #define N (1UL << 30)
> > #endif
> >
> >
> > void *Thread(void *p) {
> > for (long i = 0; i < N; i++)
> > foo(0);
> > return 0;
> > }
> >
> >
> > int main() {
> > pthread_t t[3];
> > pthread_create(&t[0], 0, Thread, 0);
> > pthread_create(&t[1], 0, Thread, 0);
> > pthread_create(&t[2], 0, Thread, 0);
> > for (long i = 0; i < N; i++)
> > bar(i);
> > pthread_join(t[0], 0);
> > pthread_join(t[1], 0);
> > pthread_join(t[2], 0);
> > }
> >
> >
> > ==> g.cc <==
> > int g;
> >
> >
> > int foo(int cond) {
> > if (cond)
> > return g;
> > // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
> > return 0;
> > }
> >
> >
> > void bar(int i) { g = i; }
> >
> >
> >
> >
> >
> > % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out
> >
> >
> >
> >
> > --kcc
> >
> >
> >
> >
> > If the two threads are running on different cores, then we have
> > likely introduced some false sharing and have forced the two cores
> > to exchange some bus traffic for cache coherency, even though is is
> > completely valid in the C++ abstract machine model for the load of g
> > to return a stale cache value. This is only a real problem on
> > strongly ordered architectures (e.g. x86), but given the relative
> > cost of a cache shootdown and everything else in this test case
> > (with the exception of the thread creation), I wouldn't be surprised
> > if it ended up slowing things down. Especially given that a modern
> > superscalar CPU will speculatively execute the load ANYWAY if it can
> > do so from cache, and if it can't then the performance improvement
> > from doing it before the branch will likely be negligible.
> >
> > For single-core, in-order, single-issue architectures, or multicore,
> > weakly ordered, in-order, single-issue architectures, it's probably
> > a win...
> >
> > David
> >
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131121/4f70b5b3/attachment.html>
    
    
More information about the llvm-dev
mailing list