[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)

Wed Nov 20 03:04:39 PST 2013

FTR: http://llvm-reviews.chandlerc.com/D2227 sent for review to fix the
tsan-hostile load speculation.

--kcc

On Wed, Nov 20, 2013 at 9:01 AM, Kostya Serebryany <kcc at google.com> wrote:

>
>
>
> On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall <
> David.Chisnall at cl.cam.ac.uk> wrote:
>
>> On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com> wrote:
>>
>> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <
>> michael.m.kuperstein at intel.com> wrote:
>> >> What I’m trying to say is that according to my understanding of the
>> C++11 memory model, even in that small reproducer, the store to g and the
>> load from g are in fact a data race.
>> >>
>> >> (This is regardless of the fact the load is protected by a branch that
>> is not taken.)
>> >
>> > My understanding of the standard is quite the opposite.
>>
>> I agree.  It is a potential data race, if the branch is taken then it
>> definitely is a race because the standard explicitly does not define an
>> ordering on loads and stores of non-atomic variables unless some
>> happens-before relationship is established by some other operation (which
>> does not occur in this example).  In the case where the branch is not
>> taken, then there is no data race because in the C++ abstract machine there
>> is no load, whether there is one in the underlying concrete machine or not.
>>
>> I believe that this transform is valid, because there are only two
>> possible cases here:
>>
>> - Some other code has established a happens-before relationship between
>> the load and the stores, giving a well-defined ordering, however this code
>> is not visible in the function foo() and so the load may happen anywhere.
>>
>> - No other code establishes a happens-before relationship, and therefore
>> the order of the load and the store is completely undefined and as long as
>> the load doesn't trap it is completely safe to hoist it.
>>
>> HOWEVER, I would dispute describing this transform as an optimisation in
>> this case.
>
>
> You have a good point: evaluating such transformation on single-threaded
> benchmarks (e.g. SPEC) makes no sense
> if we care about multi-threaded code. And evaluating performance of
> anything like this on a multi-threaded code is hard too.
> And it's very easy to prepare a synthetic test case where this
> optimization will cause 4x slowdown (see below).
> But if we disable this transformation someone surely will come and
> complain :)
> Another data point: gcc 4.6.3 does this too.
>
>
> ==> if-bench.cc <==
> #include <pthread.h>
> #include <unistd.h>
>
> extern int foo(int);
> extern void bar(int);
>
> #ifndef N
> #define N (1UL << 30)
> #endif
>
> void *Thread(void *p) {
>   for (long i = 0; i < N; i++)
>      foo(0);
>   return 0;
> }
>
> int main() {
>   pthread_t t[3];
>   pthread_create(&t[0], 0, Thread, 0);
>   pthread_create(&t[1], 0, Thread, 0);
>   pthread_create(&t[2], 0, Thread, 0);
>   for (long i = 0; i < N; i++)
>     bar(i);
>   pthread_join(t[0], 0);
>   pthread_join(t[1], 0);
>   pthread_join(t[2], 0);
> }
>
> ==> g.cc <==
> int g;
>
> int foo(int cond) {
>   if (cond)
>     return g;
>   // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster.
>   return 0;
> }
>
> void bar(int i) {  g = i; }
>
>
> % clang if-bench.cc  g.cc  -lpthread -O1  && time ./a.out
>
>
> --kcc
>
>
>
>>  If the two threads are running on different cores, then we have likely
>> introduced some false sharing and have forced the two cores to exchange
>> some bus traffic for cache coherency, even though is is completely valid in
>> the C++ abstract machine model for the load of g to return a stale cache
>> value.  This is only a real problem on strongly ordered  architectures
>> (e.g. x86), but given the relative cost of a cache shootdown and everything
>> else in this test case (with the exception of the thread creation), I
>> wouldn't be surprised if it ended up slowing things down.  Especially given
>> that a modern superscalar CPU will speculatively execute the load ANYWAY if
>> it can do so from cache, and if it can't then the performance improvement
>> from doing it before the branch will likely be negligible.
>>
>> For single-core, in-order, single-issue architectures, or multicore,
>> weakly ordered, in-order, single-issue architectures, it's probably a win...
>>
>> David
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131120/935ed959/attachment.html>