[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
David Chisnall
David.Chisnall at cl.cam.ac.uk
Tue Nov 19 10:20:42 PST 2013
On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com> wrote:
> On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M <michael.m.kuperstein at intel.com> wrote:
>> What I’m trying to say is that according to my understanding of the C++11 memory model, even in that small reproducer, the store to g and the load from g are in fact a data race.
>>
>> (This is regardless of the fact the load is protected by a branch that is not taken.)
>
> My understanding of the standard is quite the opposite.
I agree. It is a potential data race, if the branch is taken then it definitely is a race because the standard explicitly does not define an ordering on loads and stores of non-atomic variables unless some happens-before relationship is established by some other operation (which does not occur in this example). In the case where the branch is not taken, then there is no data race because in the C++ abstract machine there is no load, whether there is one in the underlying concrete machine or not.
I believe that this transform is valid, because there are only two possible cases here:
- Some other code has established a happens-before relationship between the load and the stores, giving a well-defined ordering, however this code is not visible in the function foo() and so the load may happen anywhere.
- No other code establishes a happens-before relationship, and therefore the order of the load and the store is completely undefined and as long as the load doesn't trap it is completely safe to hoist it.
HOWEVER, I would dispute describing this transform as an optimisation in this case. If the two threads are running on different cores, then we have likely introduced some false sharing and have forced the two cores to exchange some bus traffic for cache coherency, even though is is completely valid in the C++ abstract machine model for the load of g to return a stale cache value. This is only a real problem on strongly ordered architectures (e.g. x86), but given the relative cost of a cache shootdown and everything else in this test case (with the exception of the thread creation), I wouldn't be surprised if it ended up slowing things down. Especially given that a modern superscalar CPU will speculatively execute the load ANYWAY if it can do so from cache, and if it can't then the performance improvement from doing it before the branch will likely be negligible.
For single-core, in-order, single-issue architectures, or multicore, weakly ordered, in-order, single-issue architectures, it's probably a win...
David
More information about the llvm-dev
mailing list