[llvm] 6253b77 - [SLP] Respect control dependence within a block during scheduling

Sat Mar 19 13:36:31 PDT 2022

Author: Philip Reames
Date: 2022-03-19T13:36:24-07:00
New Revision: 6253b77da9f3fd7ee3286a2b4d39e87b0f0f55f5

URL: https://github.com/llvm/llvm-project/commit/6253b77da9f3fd7ee3286a2b4d39e87b0f0f55f5
DIFF: https://github.com/llvm/llvm-project/commit/6253b77da9f3fd7ee3286a2b4d39e87b0f0f55f5.diff

LOG: [SLP] Respect control dependence within a block during scheduling

This fixes an active miscompile visible in the test changes.  The basic problem is that the scheduling dependency graph didn't have any edges for control dependence within a single basic block.  The result is that we could (and in some rare cases *did*) perform reorderings within a block which could introduce new undefined behavior along paths which didn't previously contain any.

Impact wise, we have two major cases where control is not guaranteed to reach a later instruction in the block: may throw calls, and calls containing infinite loops.
* The former case was mostly covered by the memory dependencies, and to trigger require a function which can throw, but not write to memory.  In theory, such a case is possible, but not likely in practice.
* The later case is likely more of an issue in practice.  After this code was first written, we changed the IR semantics to allow well defined infinite loops without satisifying mustprogress.  Even for C/C++ - which do imply mustprogress - recent changes to how we treat atomics (e.g. an atomic read does not always imply a write) could expose this issue.  I'm a bit shocked we don't seem to have a bug report which hit this in real code actually.

Compile time wise, this results in a single extra scan of the scheduling window in the common case.  Since we stop scanning at the next instruction which isn't guaranteed to execute, no matter what order we traverse instructions in, we scan the block once.  The exception to this is that when we extend the scheduling window downwards, we invalidate all dependencies, and thus rescan.  So the potentially expensive case is when we a call in a big schedule window which is frequently extended.  We could optimize this case (by caching the last instruction not guaranteeed to transfer execution and scanning only the extended window) and starting there), but I decided to leave the complexity until it mattered.  That same case is already degenerate with memory dependences which is more expensive than the control dependence scan.

We could also consider combining the memory dependence and control dependence sets to reduce memory usage, but since it complicates the code slightly and makes debugging a bit harder, I went with the simplest scheme for now.

This was noticed while trying to understand the failures reported against D118538, but is not otherwise related to that change.

Added: 
    

Modified: 
    llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
    llvm/test/Transforms/SLPVectorizer/X86/control-dependence.ll

Removed: 
    


################################################################################
diff  --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index c2b6333af5680..36db38f6fbf5b 100644

--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -2625,6 +2625,7 @@ class BoUpSLP {
       Dependencies = InvalidDeps;
       resetUnscheduledDeps();
       MemoryDependencies.clear();
+      ControlDependencies.clear();
     }
 
     int unscheduledDepsInBundle() const {
@@ -2679,6 +2680,12 @@ class BoUpSLP {
     /// This list is derived on demand in calculateDependencies().
     SmallVector<ScheduleData *, 4> MemoryDependencies;
 
+    /// List of instructions which this instruction could be control dependent
+    /// on.  Allowing such nodes to be scheduled below this one could introduce
+    /// a runtime fault which didn't exist in the original program.
+    /// ex: this is a load or udiv following a readonly call which inf loops
+    SmallVector<ScheduleData *, 4> ControlDependencies;
+
     /// This ScheduleData is in the current scheduling region if this matches
     /// the current SchedulingRegionID of BlockScheduling.
     int SchedulingRegionID = 0;
@@ -2863,6 +2870,20 @@ class BoUpSLP {
                        << "SLP:    gets ready (mem): " << *DepBundle << "\n");
           }
         }
+        // Handle the control dependencies.
+        for (ScheduleData *DepSD : BundleMember->ControlDependencies) {
+          if (DepSD->incrementUnscheduledDeps(-1) == 0) {
+            // There are no more unscheduled dependencies after decrementing,
+            // so we can put the dependent instruction into the ready list.
+            ScheduleData *DepBundle = DepSD->FirstInBundle;
+            assert(!DepBundle->IsScheduled &&
+                   "already scheduled bundle gets ready");
+            ReadyList.insert(DepBundle);
+            LLVM_DEBUG(dbgs()
+                       << "SLP:    gets ready (ctl): " << *DepBundle << "\n");
+          }
+        }
+
       }
     }
 
@@ -2949,11 +2970,7 @@ class BoUpSLP {
                           ScheduleData *NextLoadStore);
 
     /// Updates the dependency information of a bundle and of all instructions/
-    /// bundles which depend on the original bundle.  Note that only
-    /// def-use and memory dependencies are explicitly modeled.  We do not
-    /// track control dependencies (e.g. a potentially faulting load following
-    /// a potentially infinte looping readnone call), and as such the resulting
-    /// graph is a subgraph of the full dependency graph.
+    /// bundles which depend on the original bundle.
     void calculateDependencies(ScheduleData *SD, bool InsertInReadyList,
                                BoUpSLP *SLP);
 
@@ -8043,6 +8060,32 @@ void BoUpSLP::BlockScheduling::calculateDependencies(ScheduleData *SD,
         }
       }
 
+      // Any instruction which isn't safe to speculate at the begining of the
+      // block is control dependend on any early exit or non-willreturn call
+      // which proceeds it.
+      if (!isGuaranteedToTransferExecutionToSuccessor(BundleMember->Inst)) {
+        for (Instruction *I = BundleMember->Inst->getNextNode();
+             I != ScheduleEnd; I = I->getNextNode()) {
+          if (isSafeToSpeculativelyExecute(I, &*BB->begin()))
+            continue;
+
+          // Add the dependency
+          auto *DepDest = getScheduleData(I);
+          assert(DepDest && "must be in schedule window");
+          DepDest->ControlDependencies.push_back(BundleMember);
+          BundleMember->Dependencies++;
+          ScheduleData *DestBundle = DepDest->FirstInBundle;
+          if (!DestBundle->IsScheduled)
+            BundleMember->incrementUnscheduledDeps(1);
+          if (!DestBundle->hasValidDependencies())
+            WorkList.push_back(DestBundle);
+
+          if (!isGuaranteedToTransferExecutionToSuccessor(I))
+            // Everything past here must be control dependent on I.
+            break;
+        }
+      }
+
       // Handle the memory dependencies (if any).
       ScheduleData *DepDest = BundleMember->NextLoadStore;
       if (!DepDest)
@@ -8137,12 +8180,10 @@ void BoUpSLP::scheduleBlock(BlockScheduling *BS) {
   // For the real scheduling we use a more sophisticated ready-list: it is
   // sorted by the original instruction location. This lets the final schedule
   // be as  close as possible to the original instruction order.
-  // WARNING: This required for correctness in several cases:
-  // * We must prevent reordering of potentially infinte loops inside
-  //   readnone calls with following potentially faulting instructions.
+  // WARNING: This required for correctness in the following case:
   // * We must prevent reordering of allocas with stacksave intrinsic calls.
-  // In both cases, we rely on two instructions which are both ready (per the
-  // def-use and memory dependency subgraph) not to be reordered.
+  // We rely on two instructions which are both ready (per the subgraph) not
+  // to be reordered.
   struct ScheduleDataCompare {
     bool operator()(ScheduleData *SD1, ScheduleData *SD2) const {
       return SD2->SchedulingPriority < SD1->SchedulingPriority;

diff  --git a/llvm/test/Transforms/SLPVectorizer/X86/control-dependence.ll b/llvm/test/Transforms/SLPVectorizer/X86/control-dependence.ll
index 1bf9b4c4c6d75..69f8577169df4 100644
--- a/llvm/test/Transforms/SLPVectorizer/X86/control-dependence.ll
+++ b/llvm/test/Transforms/SLPVectorizer/X86/control-dependence.ll
@@ -219,19 +219,25 @@ define void @test6(i64* %a, i64* %b, i64* %c) {
   ret void
 }
 
-; FIXME: We are hoisting a possible faulting load above a call which may
-; not return, this is wrong.
+; In this case, we can't vectorize the load pair because there's no valid
+; scheduling point which respects both memory and control dependence.  If
+; we scheduled the second load before the store holding the first one in place,
+; we'd have hoisted a potentially faulting load above a potentially infinite
+; call and thus have introduced a possible fault into a program which didn't
+; previously exist.
 define void @test7(i64* %a, i64* %b, i64* %c) {
 ; CHECK-LABEL: @test7(
-; CHECK-NEXT:    [[A2:%.*]] = getelementptr i64, i64* [[A:%.*]], i32 1
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i64* [[A]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i64>, <2 x i64>* [[TMP1]], align 4
+; CHECK-NEXT:    [[V1:%.*]] = load i64, i64* [[A:%.*]], align 4
 ; CHECK-NEXT:    store i64 0, i64* [[A]], align 4
-; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @may_inf_loop_ro()
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @may_inf_loop_ro()
+; CHECK-NEXT:    [[A2:%.*]] = getelementptr i64, i64* [[A]], i32 1
+; CHECK-NEXT:    [[V2:%.*]] = load i64, i64* [[A2]], align 4
 ; CHECK-NEXT:    [[CA2:%.*]] = getelementptr i64, i64* [[C:%.*]], i32 1
-; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i64* [[C]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP5:%.*]] = load <2 x i64>, <2 x i64>* [[TMP4]], align 4
-; CHECK-NEXT:    [[TMP6:%.*]] = add <2 x i64> [[TMP2]], [[TMP5]]
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i64* [[C]] to <2 x i64>*
+; CHECK-NEXT:    [[TMP3:%.*]] = load <2 x i64>, <2 x i64>* [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <2 x i64> poison, i64 [[V1]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x i64> [[TMP4]], i64 [[V2]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add <2 x i64> [[TMP5]], [[TMP3]]
 ; CHECK-NEXT:    [[B2:%.*]] = getelementptr i64, i64* [[B:%.*]], i32 1
 ; CHECK-NEXT:    [[TMP7:%.*]] = bitcast i64* [[B]] to <2 x i64>*
 ; CHECK-NEXT:    store <2 x i64> [[TMP6]], <2 x i64>* [[TMP7]], align 4
@@ -255,18 +261,20 @@ define void @test7(i64* %a, i64* %b, i64* %c) {
   ret void
 }
 
-; FIXME: Same as test7, but with a throwing call
+; Same as test7, but with a throwing call
 define void @test8(i64* %a, i64* %b, i64* %c) {
 ; CHECK-LABEL: @test8(
-; CHECK-NEXT:    [[A2:%.*]] = getelementptr i64, i64* [[A:%.*]], i32 1
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i64* [[A]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i64>, <2 x i64>* [[TMP1]], align 4
+; CHECK-NEXT:    [[V1:%.*]] = load i64, i64* [[A:%.*]], align 4
 ; CHECK-NEXT:    store i64 0, i64* [[A]], align 4
-; CHECK-NEXT:    [[TMP3:%.*]] = call i64 @may_throw() #[[ATTR4:[0-9]+]]
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @may_throw() #[[ATTR4:[0-9]+]]
+; CHECK-NEXT:    [[A2:%.*]] = getelementptr i64, i64* [[A]], i32 1
+; CHECK-NEXT:    [[V2:%.*]] = load i64, i64* [[A2]], align 4
 ; CHECK-NEXT:    [[CA2:%.*]] = getelementptr i64, i64* [[C:%.*]], i32 1
-; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i64* [[C]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP5:%.*]] = load <2 x i64>, <2 x i64>* [[TMP4]], align 4
-; CHECK-NEXT:    [[TMP6:%.*]] = add <2 x i64> [[TMP2]], [[TMP5]]
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i64* [[C]] to <2 x i64>*
+; CHECK-NEXT:    [[TMP3:%.*]] = load <2 x i64>, <2 x i64>* [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <2 x i64> poison, i64 [[V1]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x i64> [[TMP4]], i64 [[V2]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add <2 x i64> [[TMP5]], [[TMP3]]
 ; CHECK-NEXT:    [[B2:%.*]] = getelementptr i64, i64* [[B:%.*]], i32 1
 ; CHECK-NEXT:    [[TMP7:%.*]] = bitcast i64* [[B]] to <2 x i64>*
 ; CHECK-NEXT:    store <2 x i64> [[TMP6]], <2 x i64>* [[TMP7]], align 4
@@ -328,23 +336,24 @@ define void @test9(i64* %a, i64* %b, i64* %c) {
 }
 
 ; A variant of test7 which shows the same problem with a non-load instruction
-; FIXME: This is wrong, we're hoisting a faulting udiv above an infinite loop.
 define void @test10(i64* %a, i64* %b, i64* %c) {
 ; CHECK-LABEL: @test10(
-; CHECK-NEXT:    [[A2:%.*]] = getelementptr i64, i64* [[A:%.*]], i32 1
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i64* [[A]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i64>, <2 x i64>* [[TMP1]], align 4
-; CHECK-NEXT:    [[TMP3:%.*]] = udiv <2 x i64> <i64 200, i64 200>, [[TMP2]]
-; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x i64> [[TMP3]], i32 0
-; CHECK-NEXT:    store i64 [[TMP4]], i64* [[A]], align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @may_inf_loop_ro()
+; CHECK-NEXT:    [[V1:%.*]] = load i64, i64* [[A:%.*]], align 4
+; CHECK-NEXT:    [[A2:%.*]] = getelementptr i64, i64* [[A]], i32 1
+; CHECK-NEXT:    [[V2:%.*]] = load i64, i64* [[A2]], align 4
+; CHECK-NEXT:    [[U1:%.*]] = udiv i64 200, [[V1]]
+; CHECK-NEXT:    store i64 [[U1]], i64* [[A]], align 4
+; CHECK-NEXT:    [[TMP1:%.*]] = call i64 @may_inf_loop_ro()
+; CHECK-NEXT:    [[U2:%.*]] = udiv i64 200, [[V2]]
 ; CHECK-NEXT:    [[CA2:%.*]] = getelementptr i64, i64* [[C:%.*]], i32 1
-; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i64* [[C]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP7:%.*]] = load <2 x i64>, <2 x i64>* [[TMP6]], align 4
-; CHECK-NEXT:    [[TMP8:%.*]] = add <2 x i64> [[TMP3]], [[TMP7]]
+; CHECK-NEXT:    [[TMP2:%.*]] = bitcast i64* [[C]] to <2 x i64>*
+; CHECK-NEXT:    [[TMP3:%.*]] = load <2 x i64>, <2 x i64>* [[TMP2]], align 4
+; CHECK-NEXT:    [[TMP4:%.*]] = insertelement <2 x i64> poison, i64 [[U1]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = insertelement <2 x i64> [[TMP4]], i64 [[U2]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = add <2 x i64> [[TMP5]], [[TMP3]]
 ; CHECK-NEXT:    [[B2:%.*]] = getelementptr i64, i64* [[B:%.*]], i32 1
-; CHECK-NEXT:    [[TMP9:%.*]] = bitcast i64* [[B]] to <2 x i64>*
-; CHECK-NEXT:    store <2 x i64> [[TMP8]], <2 x i64>* [[TMP9]], align 4
+; CHECK-NEXT:    [[TMP7:%.*]] = bitcast i64* [[B]] to <2 x i64>*
+; CHECK-NEXT:    store <2 x i64> [[TMP6]], <2 x i64>* [[TMP7]], align 4
 ; CHECK-NEXT:    ret void
 ;
   %v1 = load i64, i64* %a