[PATCH] D11434: Allow value forwarding past release fences in EarlyCSE

Thu Jul 23 15:47:40 PDT 2015

reames added inline comments.

================
Comment at: lib/Transforms/Scalar/EarlyCSE.cpp:635
@@ -623,3 +634,3 @@
         // We do a trivial form of DSE if there are two stores to the same
         // location with no intervening loads.  Delete the earlier store.
         if (LastStore) {
----------------
jfb wrote:
> I'm not sure I understand why this isn't firing when there's an intervening release fence (`test4` below). I think it could: there was a race anyways, so screw the code and keep one of the store (storing the last value, of course)? Another thread is guaranteed to observe a store to that memory location, but they're racing with each other so the first one isn't guaranteed to ever be observed.
> 
> The second store is the more hilarious one to keep, but you could just store its value before the fence?
> 
> e.g.:
> ```
> *loc = 42;
> std::atomic_thread_fence(std::memory_order_release);
> *loc = 0;
> ```
> Becomes:
> ```
> std::atomic_thread_fence(std::memory_order_release);
> *loc = 0;
> ```
> Or:
> ```
> *loc = 0;
> std::atomic_thread_fence(std::memory_order_release);
> ```
> ?
See my inline response for test4 below.

================
Comment at: test/Transforms/EarlyCSE/fence.ll:43
@@ +42,3 @@
+  fence acquire
+  %a2 = load i32, i32* %addr.i, align 4
+  %res = sub i32 %a, %a2
----------------
dvyukov wrote:
> jfb wrote:
> > You can't forward `%a` and eliminate `%a2`, but you could eliminate `%a`, and replace it by `%a2` if there are no intervening uses, no?
> In this particular case we can do whatever we want with accesses to %addr.i. No other thread can access it without a race.
@jfb - Yes, but that's not what the current optimization tries to do.  Nor does EarlyCSE have a mechanism to do that in general even for unordered loads.  

@dvyukov - Yes, but we need to actually prove that and my change to EarlyCSE is no where near that aggressive.  And frankly, I'm really not interested in exploiting the undefined behaviour part of this.  

Do you have suggestions for how to restructure the test in a way which still tests EarlyCSE but avoids disallowing the optimization you pointed out?  Would simply removing the first load from the check sequence be sufficient?

================
Comment at: test/Transforms/EarlyCSE/fence.ll:57
@@ +56,3 @@
+  fence release
+  store i32 5, i32* %addr.i, align 4
+  ret void
----------------
dvyukov wrote:
> Is it a full test? Or is it something that we don't handle yet?
> I am asking because in this particular case we can do whatever we want with accesses to %addr. Since we store to %addr both before and after the fence, we can conclude that this fence is not used to synchronize accesses to %addr, so we can eliminate/combine/etc accesses to %addr.
> 
In general, I don't believe it is legal to eliminate the first store.  With all memory initialized to zero, consider these two threads:

T1:
b = 2;
a = 1;
release fence
b = 4;

T2:
while (a == 0) {}
if (b == 0) abort();

In this *particular case* I think you're reasoning is sound because we can tell the store to a doesn't exist. 

(Just to be clear, your concern here is an unnecessarily restricted optimization, not a correctness concern right?)

Would splitting this test into two - one with a store to 'a' which is a negative test and one as is which is marked as a missed optimization - satisfy you?



http://reviews.llvm.org/D11434