[PATCH] D13363: [DeadStoreElimination] Add support for DSE across blocks

Thu Oct 22 12:43:46 PDT 2015

mbodart added inline comments.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:572
@@ -582,23 +571,3 @@
 
-      // If we find a write that is a) removable (i.e., non-volatile), b) is
-      // completely obliterated by the store to 'Loc', and c) which we know that
-      // 'Inst' doesn't load from, then we can remove it.
-      if (isRemovable(DepWrite) &&
-          !isPossibleSelfRead(Inst, Loc, DepWrite, *TLI, *AA)) {
-        int64_t InstWriteOffset, DepWriteOffset;
-        OverwriteResult OR =
-            isOverwrite(Loc, DepLoc, DL, *TLI, DepWriteOffset, InstWriteOffset);
-        if (OR == OverwriteComplete) {
-          DEBUG(dbgs() << "DSE: Remove Dead Store:\n  DEAD: "
-                << *DepWrite << "\n  KILLER: " << *Inst << '\n');
-
-          // Delete the store and now-dead instructions that feed it.
-          DeleteDeadInstruction(DepWrite, *MD, *TLI);
-          ++NumFastStores;
-          MadeChange = true;
-
-          // DeleteDeadInstruction can delete the current instruction in loop
-          // cases, reset BBI.
-          BBI = Inst;
-          if (BBI != BB.begin())
-            --BBI;
+      while (InstDep.isDef() || InstDep.isClobber()) {
+        // Get the memory clobbered by the instruction we depend on.  MemDep will
----------------
I didn't see a response to my suggestion to keep track of SafeToProcessNonLocalDeps in this loop.  It seems incongruous and unnecessarily limiting to let any isDef or isClobber dependency here inhibit global DSE, while HandleNonLocalDependency tolerates them as long as they don't interfere.  That said, it may be better left for a future change set, as it will slightly complicate this loop.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:653
@@ +652,3 @@
+      }
+    } else if (InstDep.isNonLocal()) { // DSE across BB
+      if (++NumNonLocalAttempts < MaxNonLocalAttempts)
----------------
For safety I think we also need to check that Inst is not a self read.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:654
@@ +653,3 @@
+    } else if (InstDep.isNonLocal()) { // DSE across BB
+      if (++NumNonLocalAttempts < MaxNonLocalAttempts)
+        MadeChange |= handleNonLocalDependency(Inst);
----------------
I would expect any pathalogical compile time issues to occur in HandleNonLocalDependency, based on how many blocks and dependences it has to search through.  So even if MaxNonLocalAttempts is 1, we could theoretically encounter a compile time problem.  I'm not averse to keeping MaxNonLocalAttempts here, but would like to see a throttle in HandleNonLocalDependency.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:809-817
@@ +808,11 @@
+    if (!DT->isReachableFromEntry(Pred) || DT->dominates(BB, Pred))
+      continue;
+    TerminatorInst *PredTI = Pred->getTerminator(); 
+    if ((PredTI->getNumSuccessors() == 1) ||
+        ((PredTI->getNumSuccessors() == 2) &&
+         ((PredTI->getSuccessor(0) == Pred) ||
+          (PredTI->getSuccessor(1) == Pred) ||
+          (SafeBlocks.count(PredTI->getSuccessor(0)) &&
+           SafeBlocks.count(PredTI->getSuccessor(1))))))
+      PredBlocks.push_back(Pred);
+  }
----------------
ivanbaev wrote:
> majnemer wrote:
> > The nesting here is still very large.  I'm also not entirely convinced that this code is correct.  What if `PredTI->getNumSuccessors() == 1` is true but `PredTI->getSuccessor(0) == Pred` is false?  In that case, won't you access `PredTI->getSuccessor(1) == Pred` ?  It would be safer and more clear to simply loop over `successors`.
> if PredTI->getNumSuccessors() == 1 is true, then we exit the if statement checks at this point, so we will not access PredTI->getSuccessor(0), etc.
> 
> Do you think that the following is better? 
> if (PredTI->getNumSuccessors() == 1)  {
>   PredBlocks.push_back(Pred);
>   continue;
> }
> if (PredTI->getNumSuccessors() == 2 &&
>   (PredTI->getSuccessor(0) == Pred || PredTI->getSuccessor(1) == Pred ||
>   (SafeBlocks.count(PredTI->getSuccessor(0)) &&
>       SafeBlocks.count(PredTI->getSuccessor(1)))))
>   PredBlocks.push_back(Pred);
> 
I second the suggestion to use a Successor iterator here.
Additionally, I think the checks for Pred == BB and DT->dominates(BB, Pred) are unnecessary.
Note that BB (and StartBB) should already be in SafeBlocks, so the SafeBlocks.count checks should provide sufficient safety.  I would think the routine simplifies to something like this:

for each pred P of BB
    if P is reachable and !SafeBlocks.count(P))
        PredOK = true;
        for each succ S of P
            if !SafeBlocks.count(S) && S != P
                PredOK = false;
        if PredOK, add P to PredBlocks

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:821
@@ +820,3 @@
+
+static bool underlyingObjectsDoNotAlias(StoreInst *SI, LoadInst *LI,
+                                        const DataLayout &DL,
----------------
I'm still ramping up on LLVM and thus not deeply familiar with all of AA's capabilities.
But it struck me as odd that we would need a custom routine to look for a memory conflict.
Is there no existing AA or MemoryDep interface that meets our need, and if not, should one be added?

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:854
@@ +853,3 @@
+  BasicBlock *BB = Inst->getParent();
+  const DataLayout &DL = BB->getModule()->getDataLayout();
+
----------------
ivanbaev wrote:
> majnemer wrote:
> > This hasn't been addressed.
> That is a good point. There are many places in DSE where DataLayout is used or passed as argument. However, it seems that most transformations in Transform/*, e.g. Vectorize/LoopVectorize.cpp, Scalar/SROA.cpp, Scalar/GVN.cpp, favor accessing DataLayout this way. 
> [Ivan] Each BB ends with a terminator instruction and these should not have contain dependencies for the store.

An Invoke is one instance of a terminator that can have memory dependences.
I don't know if our CFG search could ever get through both edges coming out of an Invoke,
but I think it is best to avoid any such assumptions, and let AA make the dependency determinations.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:892
@@ +891,3 @@
+      // above the store within the loop, etc.
+      if ((InstPt->getNumSuccessors() == 2) &&
+          ((InstPt->getSuccessor(0) == PB) ||
----------------
This test is loop invariant and should probably be cached before the loop.
This is especially  true after generalizing FindSafePreds to allow an arbitrary number of successors.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:911
@@ +910,3 @@
+        // No need to search for redundant stores up in the block or
+        // up into predecessors; those should have been handled earlier.
+      }
----------------
This is true for the common case of exact overlap.
But it might be worth adding a TBD comment to the effect that we could reasonably continue the search if the dead instruction was only a partial overlap.

I'm not suggesting we do the full "shortening" transformation that local DSE does (though that might be fun).  But catching simple scalar stores would just work.

================
Comment at: lib/Transforms/Scalar/DeadStoreElimination.cpp:919
@@ +918,3 @@
+    // to the set of blocks to search for redundant stores.
+    if (!DependencyFound) {
+      SafeBlocks.insert(PB);
----------------

> [Ivan] We know that the first dependence - which triggers invocation of HandleNonLocalDependency(Inst) - is non-local

Perhaps I don't quite understand the possible return values from MD, but it seems like Dep could be Unknown or NonLocalFunc here, in which case we should terminate the search.

================
Comment at: test/Transforms/DeadStoreElimination/cycle.ll:2-3
@@ +1,4 @@
+; RUN: opt < %s -basicaa -dse -S | FileCheck %s
+target datalayout = "e-m:e-p:32:32-i64:64-v128:64:128-n32-S64"
+target triple = "armv7-none-linux-gnueabi"
+
----------------
These tests seem fairly generic.
I would prefer that we remove the explicit datalayout and triple here.


http://reviews.llvm.org/D13363