[PATCH] Further relax the constraint on atomics in MemoryDependencyAnalysis.cpp

Thu Aug 21 13:30:42 PDT 2014

Thank you very much for the careful reviews !

Inline comments below with answers to your comments.

================
Comment at: lib/Analysis/MemoryDependenceAnalysis.cpp:388
@@ -373,1 +387,3 @@
+  bool HasSeenAcquire = false;
+
   if (isLoad && QueryInst) {
----------------
Philip Reames wrote:
> I'm still not finding this explanation particularly clear.  I'm going to ask you not to commit this until you can justify why this approach is correct.  
> 
> A potentially more intuitive way to explain this is to why this reordering is valid:
> store x = 0
> acquire operator
> release operation
> load x < -- okay to use zero
> 
> This is valid because the first pair of operations and the second pair can both be reordered.  i.e. the intermediate state is:
> acquire operation
> store x = 0
> load x < -- now obviously zero
> release operation
> 
> What is "discriminating context"?  What does it mean to "clobber"?
> 
The reordering approach is not enough to explain this optimisation, as there is no way of hoisting the store of x back above the acquire operation in the example you gave.

I will try to give a more detailed explanation below, if you find it clearer I will put it in the comments.
In the following code:

```
store x = 0
release operation (1)
acquire operation (4)
%val = load x
```
it is not okay to replace %val by 0, because another thread may be doing:

```
acquire operation (2)
store x = 42
release operation (3)
```
And if the program ensures that 1 synchronizes with 2, and 3 with 4 then this code is correct and %val ends up being 42 and not 0.

A key property of the above program, is that if either (1) or (4) are absent, there would be a race between the store x = 42 and either the original store or the subsequent load (so the whole program is undefined). It can be shown (mostly through an excruciatingly boring case analysis) that every such program where the optimisation is visible needs such a pair of release-acquire pair for synchronisation between the threads or is racy.

(Discriminating context means "any number of threads which would make this optimisation visible if they were running concurrently with that one". I should indeed remove it as it seems unclear).


================
Comment at: lib/Analysis/MemoryDependenceAnalysis.cpp:440
@@ -423,3 +439,3 @@
             return MemDepResult::getClobber(LI);
         if (auto *QuerySI = dyn_cast<StoreInst>(QueryInst))
           if (!QuerySI->isSimple())
----------------
Philip Reames wrote:
> This is a bit off topic for this review, but I may have spotted a bug here.  What if the query instruction is a RMW operation?  These can have ordering semantics, but the existing code (from your previous change) would say there's no dependence.  
Indeed, thank you very much for find this.. it is incredible how hard to get right these things appear to be. I will send another patch shortly fixing it.
I would suggest making isAtomic()/isSimple() methods of all instructions (just return false/true respectively for the instructions that do not override it) and just checking that. In this way There would be much less risk in the future of forgetting one such case (there is also CmpXchg for example). Does this sound reasonable ?

================
Comment at: lib/Analysis/MemoryDependenceAnalysis.cpp:443
@@ -426,1 +442,3 @@
             return MemDepResult::getClobber(LI);
+        if (isAtLeastAcquire(LI->getOrdering()))
+          HasSeenAcquire = true;
----------------
Philip Reames wrote:
> I think this check is wrong.  (Independent of whether the approach is valid or not.)
> 
> My reasoning is that, a cst operation is "at least" an acquire.  However, it's _also_ a release and reordering these two is not legal:
> load cst y
> store x = 0
> 
> This would be a violation of a "release operation" ordering as used by C++11.  It's unclear whether that would violate the LLVM spec as written, but I think it probably should.  
> 
> Note that if my reasoning is sound, this is actually a problem with the previous patch as well, not just this one.  
> 
I am pretty sure a cst operation only behaves as a release if it stores something (and as an acquire only if it loads something).

For example, the 29.3 section starts by defining the memory orders in this way:
[..] and memory_order_seq_cst: a store operation performs a release operation on the affected memory location.
This is also true in the formalisation of Batty&al: a synchronizes-with relation can only be from a store or fence to a load or fence [http://www.cl.cam.ac.uk/~pes20/cpp/]

So for this purpose, seq_cst loads do behave mostly like acquire loads (the difference is the extra total order on all seq_cst operations, that is irrelevant for the reasoning in this patch).

================
Comment at: test/Transforms/DeadStoreElimination/atomic.ll:149
@@ +148,3 @@
+; CHECK: store i32 1
+  store i32 0, i32* @x
+  %x = load atomic i32* @y seq_cst, align 4
----------------
Philip Reames wrote:
> As currently specified in the LangRef, I don't think this is legal.  If I'm reading the spec right, the load must be considered both an Acquire and a Release.  (This is possibly not intended, but seems to follow from the wording.)  As a result, this is not a acquire/release pair, but both a acquire/release and release/acquire pair.  
> 
> Finding a case where this is observable would be hard, but it seems possible.  If need be, I'll spend some time thinking about it.  
In the LangRef, just under the first occurence of memory_order_seq_cst I see:
"In addition to the guarantees of acq_rel (acquire for an operation which only reads, release for an operation which only writes)"
It seems pretty clear to me that it follows the C++11 standard, and that a load cannot be considered a release, even if it is seq_cst.
If you see a case where this optimisation would be observable, I am **extremely** interested in it.

================
Comment at: test/Transforms/DeadStoreElimination/atomic.ll:163
@@ +162,3 @@
+  store atomic i32 0, i32* @y release, align 4
+  %x = load atomic i32* @y acquire, align 4
+  store i32 1, i32* @x
----------------
JF Bastien wrote:
> Why not also test load-acq followed by store-rel here and in other places (or the reverse store/load)? It seems like a good sanity check.
I agree, that was intended to be covered by the previous test, but it cannot hurt to add another one, I will do it.

================
Comment at: test/Transforms/GVN/atomic.ll:46
@@ -45,3 +45,3 @@
 
-; GVN across acquire load (load after atomic load must not be removed)
+; GVN across acquire load (allowed as the original load was not atomic)
 define i32 @test4() nounwind uwtable ssp {
----------------
Philip Reames wrote:
> I don't believe this is valid.
For this optimisation to be invalid, @y would have to change between the two loads.
Such a store to @y would necessarily race with the first load (to %x) and make the whole program undefined.
So my argument is that this is correct because it can only be observed by racy programs.

================
Comment at: test/Transforms/GVN/atomic.ll:110
@@ -85,1 +109,3 @@
+define i32 @test9() nounwind uwtable ssp {
+; CHECK: test9
 ; CHECK: add i32 %x, %x
----------------
Philip Reames wrote:
> CHECK-LABEL please
Sure, sorry about forgetting that.

================
Comment at: test/Transforms/GVN/atomic.ll:133
@@ -106,1 +132,2 @@
 
+
----------------
JF Bastien wrote:
> Drop the extra space.
Yes, I didn't notice it.

http://reviews.llvm.org/D4845