[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

Kostya Serebryany kcc at google.com
Tue Nov 19 21:01:12 PST 2013


On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
David.Chisnall at cl.cam.ac.uk> wrote:

> On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com> wrote:
>
> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
> michael.m.kuperstein at intel.com> wrote:
> >> What I’m trying to say is that according to my understanding of the
> C++11 memory model, even in that small reproducer, the store to g and the
> load from g are in fact a data race.
> >>
> >> (This is regardless of the fact the load is protected by a branch that
> is not taken.)
> >
> > My understanding of the standard is quite the opposite.
>
> I agree.  It is a potential data race, if the branch is taken then it
> definitely is a race because the standard explicitly does not define an
> ordering on loads and stores of non-atomic variables unless some
> happens-before relationship is established by some other operation (which
> does not occur in this example).  In the case where the branch is not
> taken, then there is no data race because in the C++ abstract machine there
> is no load, whether there is one in the underlying concrete machine or not.
>
> I believe that this transform is valid, because there are only two
> possible cases here:
>
> - Some other code has established a happens-before relationship between
> the load and the stores, giving a well-defined ordering, however this code
> is not visible in the function foo() and so the load may happen anywhere.
>
> - No other code establishes a happens-before relationship, and therefore
> the order of the load and the store is completely undefined and as long as
> the load doesn't trap it is completely safe to hoist it.
>
> HOWEVER, I would dispute describing this transform as an optimisation in
> this case.


You have a good point: evaluating such transformation on single-threaded
benchmarks (e.g. SPEC) makes no sense
if we care about multi-threaded code. And evaluating performance of
anything like this on a multi-threaded code is hard too.
And it's very easy to prepare a synthetic test case where this optimization
will cause 4x slowdown (see below).
But if we disable this transformation someone surely will come and complain
:)
Another data point: gcc 4.6.3 does this too.


==> if-bench.cc <==
#include <pthread.h>
#include <unistd.h>

extern int foo(int);
extern void bar(int);

#ifndef N
#define N (1UL << 30)
#endif

void *Thread(void *p) {
  for (long i = 0; i < N; i++)
     foo(0);
  return 0;
}

int main() {
  pthread_t t[3];
  pthread_create(&t[0], 0, Thread, 0);
  pthread_create(&t[1], 0, Thread, 0);
  pthread_create(&t[2], 0, Thread, 0);
  for (long i = 0; i < N; i++)
    bar(i);
  pthread_join(t[0], 0);
  pthread_join(t[1], 0);
  pthread_join(t[2], 0);
}

==> g.cc <==
int g;

int foo(int cond) {
  if (cond)
    return g;
  // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
  return 0;
}

void bar(int i) {  g = i; }


% clang if-bench.cc  g.cc  -lpthread -O1  && time ./a.out


--kcc



>  If the two threads are running on different cores, then we have likely
> introduced some false sharing and have forced the two cores to exchange
> some bus traffic for cache coherency, even though is is completely valid in
> the C++ abstract machine model for the load of g to return a stale cache
> value.  This is only a real problem on strongly ordered  architectures
> (e.g. x86), but given the relative cost of a cache shootdown and everything
> else in this test case (with the exception of the thread creation), I
> wouldn't be surprised if it ended up slowing things down.  Especially given
> that a modern superscalar CPU will speculatively execute the load ANYWAY if
> it can do so from cache, and if it can't then the performance improvement
> from doing it before the branch will likely be negligible.
>
> For single-core, in-order, single-issue architectures, or multicore,
> weakly ordered, in-order, single-issue architectures, it's probably a win...
>
> David
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131120/c953f891/attachment.html>


More information about the llvm-dev mailing list