[PATCH] Further relax the constraint on atomics in MemoryDependencyAnalysis.cpp

Thu Aug 21 17:18:47 PDT 2014

comments inline.

================
Comment at: lib/Analysis/MemoryDependenceAnalysis.cpp:388
@@ -373,1 +387,3 @@
+  bool HasSeenAcquire = false;
+
   if (isLoad && QueryInst) {
----------------
Robin Morisset wrote:
> Philip Reames wrote:
> > I'm still not finding this explanation particularly clear.  I'm going to ask you not to commit this until you can justify why this approach is correct.  
> > 
> > A potentially more intuitive way to explain this is to why this reordering is valid:
> > store x = 0
> > acquire operator
> > release operation
> > load x < -- okay to use zero
> > 
> > This is valid because the first pair of operations and the second pair can both be reordered.  i.e. the intermediate state is:
> > acquire operation
> > store x = 0
> > load x < -- now obviously zero
> > release operation
> > 
> > What is "discriminating context"?  What does it mean to "clobber"?
> > 
> The reordering approach is not enough to explain this optimisation, as there is no way of hoisting the store of x back above the acquire operation in the example you gave.
> 
> I will try to give a more detailed explanation below, if you find it clearer I will put it in the comments.
> In the following code:
> 
> ```
> store x = 0
> release operation (1)
> acquire operation (4)
> %val = load x
> ```
> it is not okay to replace %val by 0, because another thread may be doing:
> 
> ```
> acquire operation (2)
> store x = 42
> release operation (3)
> ```
> And if the program ensures that 1 synchronizes with 2, and 3 with 4 then this code is correct and %val ends up being 42 and not 0.
> 
> A key property of the above program, is that if either (1) or (4) are absent, there would be a race between the store x = 42 and either the original store or the subsequent load (so the whole program is undefined). It can be shown (mostly through an excruciatingly boring case analysis) that every such program where the optimisation is visible needs such a pair of release-acquire pair for synchronisation between the threads or is racy.
> 
> (Discriminating context means "any number of threads which would make this optimisation visible if they were running concurrently with that one". I should indeed remove it as it seems unclear).
> 
I do find your explanation more clear.  Putting that in a comment or documentation would be a good idea.  This gives the intuition for the approach.  

I'd avoid the term "discriminating context".  It's extra jargon with no real gain.

FYI, when I think about the legality of the optimization in terms of reordering, I tend to think "if I did this, can I do that?"  It doesn't necessarily imply that I actually need to perform the reordering.  It simply means that I *could* and thus it is legal to do the optimization.  Now, potentially I might get myself in trouble by assuming two conflicting reorderings, but I've never actually run into that. (yet)  My experience is that if I can't find some series of valid reorderings that could enable an optimization, it usually isn't actually correct.  :)

================
Comment at: lib/Analysis/MemoryDependenceAnalysis.cpp:440
@@ -423,3 +439,3 @@
             return MemDepResult::getClobber(LI);
         if (auto *QuerySI = dyn_cast<StoreInst>(QueryInst))
           if (!QuerySI->isSimple())
----------------
Robin Morisset wrote:
> Philip Reames wrote:
> > This is a bit off topic for this review, but I may have spotted a bug here.  What if the query instruction is a RMW operation?  These can have ordering semantics, but the existing code (from your previous change) would say there's no dependence.  
> Indeed, thank you very much for find this.. it is incredible how hard to get right these things appear to be. I will send another patch shortly fixing it.
> I would suggest making isAtomic()/isSimple() methods of all instructions (just return false/true respectively for the instructions that do not override it) and just checking that. In this way There would be much less risk in the future of forgetting one such case (there is also CmpXchg for example). Does this sound reasonable ?
This approach seems reasonable to me.  

Alternately, you could use mayReadMemory() as your generic fallback check.  

================
Comment at: lib/Analysis/MemoryDependenceAnalysis.cpp:443
@@ -426,1 +442,3 @@
             return MemDepResult::getClobber(LI);
+        if (isAtLeastAcquire(LI->getOrdering()))
+          HasSeenAcquire = true;
----------------
Robin Morisset wrote:
> Philip Reames wrote:
> > I think this check is wrong.  (Independent of whether the approach is valid or not.)
> > 
> > My reasoning is that, a cst operation is "at least" an acquire.  However, it's _also_ a release and reordering these two is not legal:
> > load cst y
> > store x = 0
> > 
> > This would be a violation of a "release operation" ordering as used by C++11.  It's unclear whether that would violate the LLVM spec as written, but I think it probably should.  
> > 
> > Note that if my reasoning is sound, this is actually a problem with the previous patch as well, not just this one.  
> > 
> I am pretty sure a cst operation only behaves as a release if it stores something (and as an acquire only if it loads something).
> 
> For example, the 29.3 section starts by defining the memory orders in this way:
> [..] and memory_order_seq_cst: a store operation performs a release operation on the affected memory location.
> This is also true in the formalisation of Batty&al: a synchronizes-with relation can only be from a store or fence to a load or fence [http://www.cl.cam.ac.uk/~pes20/cpp/]
> 
> So for this purpose, seq_cst loads do behave mostly like acquire loads (the difference is the extra total order on all seq_cst operations, that is irrelevant for the reasoning in this patch).
You're right.  I went and checked the spec and my interpretation was incorrect.

================
Comment at: test/Transforms/DeadStoreElimination/atomic.ll:149
@@ +148,3 @@
+; CHECK: store i32 1
+  store i32 0, i32* @x
+  %x = load atomic i32* @y seq_cst, align 4
----------------
Robin Morisset wrote:
> Philip Reames wrote:
> > As currently specified in the LangRef, I don't think this is legal.  If I'm reading the spec right, the load must be considered both an Acquire and a Release.  (This is possibly not intended, but seems to follow from the wording.)  As a result, this is not a acquire/release pair, but both a acquire/release and release/acquire pair.  
> > 
> > Finding a case where this is observable would be hard, but it seems possible.  If need be, I'll spend some time thinking about it.  
> In the LangRef, just under the first occurence of memory_order_seq_cst I see:
> "In addition to the guarantees of acq_rel (acquire for an operation which only reads, release for an operation which only writes)"
> It seems pretty clear to me that it follows the C++11 standard, and that a load cannot be considered a release, even if it is seq_cst.
> If you see a case where this optimisation would be observable, I am **extremely** interested in it.
Thanks for pointing out that wording.  I was looking in acq_rel.  Any chance you could move that to be under acq_rel?  It seems odd to have a later order defining a previous one.

For this particular example, it's only valid because we know @x and @y are distinct locations.  If we didn't know that, we couldn't remove the first store without changing %x, and thus the observable value.  

================
Comment at: test/Transforms/GVN/atomic.ll:46
@@ -45,3 +45,3 @@
 
-; GVN across acquire load (load after atomic load must not be removed)
+; GVN across acquire load (allowed as the original load was not atomic)
 define i32 @test4() nounwind uwtable ssp {
----------------
Robin Morisset wrote:
> Philip Reames wrote:
> > I don't believe this is valid.
> For this optimisation to be invalid, @y would have to change between the two loads.
> Such a store to @y would necessarily race with the first load (to %x) and make the whole program undefined.
> So my argument is that this is correct because it can only be observed by racy programs.
I had to think about this one a bit.  :)  I can't find an argument to refute your racy program one, but I find that answer disturbing.  From a practical software engineering perspective, the answer "oh, we corrupted all of the runtime state in hard to explain ways because you had one race somewhere" is a bit hard to swallow.  I will admit it's what the c++11 standard says though.  

I'm curious, do you believe the answer changes if both loads are monotonic?  If not, this would seem to imply that LLVM can't implement the Java memory model correctly.  The JMM does specify limited semantics for racy programs.

http://reviews.llvm.org/D4845