[llvm] r326925 - [Pipeliner] Fixed node order issue related to zero latency edges

Wed Mar 7 10:53:36 PST 2018

Author: jwroorda
Date: Wed Mar  7 10:53:36 2018
New Revision: 326925

URL: http://llvm.org/viewvc/llvm-project?rev=326925&view=rev
Log:
[Pipeliner] Fixed node order issue related to zero latency edges

Summary:
A desired property of the node order in Swing Modulo Scheduling is
that for nodes outside circuits the following holds: none of them is
scheduled after both a successor and a predecessor. We call
node orders that meet this property valid.

Although invalid node orders do not lead to the generation of incorrect
code, they can cause the pipeliner not being able to find a pipelined schedule
for arbitrary II. The reason is that after scheduling the successor and the
predecessor of a node, no room may be left to schedule the node itself.

For data flow graphs with 0-latency edges, the node ordering algorithm
of Swing Modulo Scheduling can generate such undesired invalid node orders.
This patch fixes that.

In the remainder of this commit message, I will give an example
demonstrating the issue, explain the fix, and explain how the the fix is tested.

Consider, as an example, the following data flow graph with all
edge latencies 0 and all edges pointing downward.

```
   n0
  /  \
n1    n3
  \  /
   n2
    |
   n4
```

Consider the implemented node order algorithm in top-down mode. In that mode,
the algorithm orders the nodes based on greatest Height and in case of equal
Height on lowest Movability. Finally, in case of equal Height and
Movability, given two nodes with an edge between them, the algorithm prefers
the source-node.

In the graph, for every node, the Height and Movability are equal to 0.
As will be explained below, the algorithm can generate the order n0, n1, n2, n3, n4.
So, node n3 is scheduled after its predecessor n0 and after its successor n2.

The reason that the algorithm can put node n2 in the order before node n3,
even though they have an edge between them in which node n3 is the source,
is the following: Suppose the algorithm has constructed the partial node
order n0, n1. Then, the nodes left to be ordered are nodes n2, n3, and n4. Suppose
that the while-loop in the implemented algorithm considers the nodes in
the order n4, n3, n2. The algorithm will start with node n4, and look for
more preferable nodes. First, node n4 will be compared with node n3. As the nodes
have equal Height and Movability and have no edge between them, the algorithm
will stick with node n4. Then node n4 is compared with node n2. Again the
Height and Movability are equal. But, this time, there is an edge between
the two nodes, and the algorithm will prefer the source node n2.
As there are no nodes left to compare, the algorithm will add node n2 to
the node order, yielding the partial node order n0, n1, n2. In this way node n2
arrives in the node-order before node n3.

To solve this, this patch introduces the ZeroLatencyHeight (ZLH) property
for nodes. It is defined as the maximum unweighted length of a path from the
given node to an arbitrary node in which each edge has latency 0.
So, ZLH(n0)=3, ZLH(n1)=ZLH(n3)=2, ZLH(n2)=1, and ZLH(n4)=0

In this patch, the preference for a greater ZeroLatencyHeight
is added in the top-down mode of the node ordering algorithm, after the
preference for a greater Height, and before the preference for a
lower Movability.

Therefore, the two allowed node-orders are n0, n1, n3, n2, n4 and n0, n3, n1, n2, n4.
Both of them are valid node orders.

In the same way, the bottom-up mode of the node ordering algorithm is adapted
by introducing the ZeroLatencyDepth property for nodes.

The patch is tested by adding extra checks to the following existing
lit-tests:
test/CodeGen/Hexagon/SUnit-boundary-prob.ll
test/CodeGen/Hexagon/frame-offset-overflow.ll
test/CodeGen/Hexagon/vect/vect-shuffle.ll

Before this patch, the pipeliner failed to pipeline the loops in these tests
due to invalid node-orders. After the patch, the pipeliner successfully
pipelines all these loops.

Reviewers: bcahoon

Reviewed By: bcahoon

Subscribers: Ayal, mgrang, llvm-commits

Differential Revision: https://reviews.llvm.org/D43620

Modified:
    llvm/trunk/lib/CodeGen/MachinePipeliner.cpp
    llvm/trunk/test/CodeGen/Hexagon/SUnit-boundary-prob.ll
    llvm/trunk/test/CodeGen/Hexagon/frame-offset-overflow.ll
    llvm/trunk/test/CodeGen/Hexagon/swp-vmult.ll
    llvm/trunk/test/CodeGen/Hexagon/vect/vect-shuffle.ll

Modified: llvm/trunk/lib/CodeGen/MachinePipeliner.cpp
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/MachinePipeliner.cpp?rev=326925&r1=326924&r2=326925&view=diff
==============================================================================

--- llvm/trunk/lib/CodeGen/MachinePipeliner.cpp (original)
+++ llvm/trunk/lib/CodeGen/MachinePipeliner.cpp Wed Mar  7 10:53:36 2018
@@ -125,6 +125,7 @@ using namespace llvm;
 
 STATISTIC(NumTrytoPipeline, "Number of loops that we attempt to pipeline");
 STATISTIC(NumPipelined, "Number of loops software pipelined");
+STATISTIC(NumNodeOrderIssues, "Number of node order issues found");
 
 /// A command line option to turn software pipelining on or off.
 static cl::opt<bool> EnableSWP("enable-pipeliner", cl::Hidden, cl::init(true),
@@ -241,6 +242,8 @@ class SwingSchedulerDAG : public Schedul
   struct NodeInfo {
     int ASAP = 0;
     int ALAP = 0;
+    int ZeroLatencyDepth = 0;
+    int ZeroLatencyHeight = 0;
 
     NodeInfo() = default;
   };
@@ -320,9 +323,21 @@ public:
   /// The depth, in the dependence graph, for a node.
   int getDepth(SUnit *Node) { return Node->getDepth(); }
 
+  /// The maximum unweighted length of a path from an arbitrary node to the
+  /// given node in which each edge has latency 0
+  int getZeroLatencyDepth(SUnit *Node) {
+    return ScheduleInfo[Node->NodeNum].ZeroLatencyDepth;
+  }
+
   /// The height, in the dependence graph, for a node.
   int getHeight(SUnit *Node) { return Node->getHeight(); }
 
+  /// The maximum unweighted length of a path from the given node to an
+  /// arbitrary node in which each edge has latency 0
+  int getZeroLatencyHeight(SUnit *Node) {
+    return ScheduleInfo[Node->NodeNum].ZeroLatencyHeight;
+  }
+
   /// Return true if the dependence is a back-edge in the data dependence graph.
   /// Since the DAG doesn't contain cycles, we represent a cycle in the graph
   /// using an anti dependence from a Phi to an instruction.
@@ -404,6 +419,7 @@ private:
   void addConnectedNodes(SUnit *SU, NodeSet &NewSet,
                          SetVector<SUnit *> &NodesAdded);
   void computeNodeOrder(NodeSetType &NodeSets);
+  void checkValidNodeOrder(const NodeSetType &Circuits) const;
   bool schedulePipeline(SMSchedule &Schedule);
   void generatePipelinedLoop(SMSchedule &Schedule);
   void generateProlog(SMSchedule &Schedule, unsigned LastStage,
@@ -863,6 +879,7 @@ void SwingSchedulerDAG::schedule() {
 
   NodeSetType NodeSets;
   findCircuits(NodeSets);
+  NodeSetType Circuits = NodeSets;
 
   // Calculate the MII.
   unsigned ResMII = calculateResMII();
@@ -916,6 +933,9 @@ void SwingSchedulerDAG::schedule() {
 
   computeNodeOrder(NodeSets);
 
+  // check for node order issues
+  checkValidNodeOrder(Circuits);
+
   SMSchedule Schedule(Pass.MF);
   Scheduled = schedulePipeline(Schedule);
 
@@ -1568,42 +1588,52 @@ void SwingSchedulerDAG::computeNodeFunct
   });
 
   int maxASAP = 0;
-  // Compute ASAP.
+  // Compute ASAP and ZeroLatencyDepth.
   for (ScheduleDAGTopologicalSort::const_iterator I = Topo.begin(),
                                                   E = Topo.end();
        I != E; ++I) {
     int asap = 0;
+    int zeroLatencyDepth = 0;
     SUnit *SU = &SUnits[*I];
     for (SUnit::const_pred_iterator IP = SU->Preds.begin(),
                                     EP = SU->Preds.end();
          IP != EP; ++IP) {
+      SUnit *pred = IP->getSUnit();
+      if (getLatency(SU, *IP) == 0)
+        zeroLatencyDepth =
+            std::max(zeroLatencyDepth, getZeroLatencyDepth(pred) + 1);
       if (ignoreDependence(*IP, true))
         continue;
-      SUnit *pred = IP->getSUnit();
       asap = std::max(asap, (int)(getASAP(pred) + getLatency(SU, *IP) -
                                   getDistance(pred, SU, *IP) * MII));
     }
     maxASAP = std::max(maxASAP, asap);
     ScheduleInfo[*I].ASAP = asap;
+    ScheduleInfo[*I].ZeroLatencyDepth = zeroLatencyDepth;
   }
 
-  // Compute ALAP and MOV.
+  // Compute ALAP, ZeroLatencyHeight, and MOV.
   for (ScheduleDAGTopologicalSort::const_reverse_iterator I = Topo.rbegin(),
                                                           E = Topo.rend();
        I != E; ++I) {
     int alap = maxASAP;
+    int zeroLatencyHeight = 0;
     SUnit *SU = &SUnits[*I];
     for (SUnit::const_succ_iterator IS = SU->Succs.begin(),
                                     ES = SU->Succs.end();
          IS != ES; ++IS) {
+      SUnit *succ = IS->getSUnit();
+      if (getLatency(SU, *IS) == 0)
+        zeroLatencyHeight =
+            std::max(zeroLatencyHeight, getZeroLatencyHeight(succ) + 1);
       if (ignoreDependence(*IS, true))
         continue;
-      SUnit *succ = IS->getSUnit();
       alap = std::min(alap, (int)(getALAP(succ) - getLatency(SU, *IS) +
                                   getDistance(SU, succ, *IS) * MII));
     }
 
     ScheduleInfo[*I].ALAP = alap;
+    ScheduleInfo[*I].ZeroLatencyHeight = zeroLatencyHeight;
   }
 
   // After computing the node functions, compute the summary for each node set.
@@ -1618,6 +1648,8 @@ void SwingSchedulerDAG::computeNodeFunct
       dbgs() << "\t   MOV  = " << getMOV(&SUnits[i]) << "\n";
       dbgs() << "\t   D    = " << getDepth(&SUnits[i]) << "\n";
       dbgs() << "\t   H    = " << getHeight(&SUnits[i]) << "\n";
+      dbgs() << "\t   ZLD  = " << getZeroLatencyDepth(&SUnits[i]) << "\n";
+      dbgs() << "\t   ZLH  = " << getZeroLatencyHeight(&SUnits[i]) << "\n";
     }
   });
 }
@@ -1986,14 +2018,6 @@ void SwingSchedulerDAG::removeDuplicateN
     }
 }
 
-/// Return true if Inst1 defines a value that is used in Inst2.
-static bool hasDataDependence(SUnit *Inst1, SUnit *Inst2) {
-  for (auto &SI : Inst1->Succs)
-    if (SI.getSUnit() == Inst2 && SI.getKind() == SDep::Data)
-      return true;
-  return false;
-}
-
 /// Compute an ordered list of the dependence graph nodes, which
 /// indicates the order that the nodes will be scheduled.  This is a
 /// two-level algorithm. First, a partial order is created, which
@@ -2040,18 +2064,20 @@ void SwingSchedulerDAG::computeNodeOrder
     while (!R.empty()) {
       if (Order == TopDown) {
         // Choose the node with the maximum height.  If more than one, choose
-        // the node with the lowest MOV. If still more than one, check if there
-        // is a dependence between the instructions.
+        // the node with the maximum ZeroLatencyHeight. If still more than one,
+        // choose the node with the lowest MOV.
         while (!R.empty()) {
           SUnit *maxHeight = nullptr;
           for (SUnit *I : R) {
             if (maxHeight == nullptr || getHeight(I) > getHeight(maxHeight))
               maxHeight = I;
             else if (getHeight(I) == getHeight(maxHeight) &&
-                     getMOV(I) < getMOV(maxHeight) &&
-                     !hasDataDependence(maxHeight, I))
+                     getZeroLatencyHeight(I) > getZeroLatencyHeight(maxHeight))
               maxHeight = I;
-            else if (hasDataDependence(I, maxHeight))
+            else if (getHeight(I) == getHeight(maxHeight) &&
+                     getZeroLatencyHeight(I) ==
+                         getZeroLatencyHeight(maxHeight) &&
+                     getMOV(I) < getMOV(maxHeight))
               maxHeight = I;
           }
           NodeOrder.insert(maxHeight);
@@ -2084,18 +2110,19 @@ void SwingSchedulerDAG::computeNodeOrder
           R.insert(N.begin(), N.end());
       } else {
         // Choose the node with the maximum depth.  If more than one, choose
-        // the node with the lowest MOV. If there is still more than one, check
-        // for a dependence between the instructions.
+        // the node with the maximum ZeroLatencyDepth. If still more than one,
+        // choose the node with the lowest MOV.
         while (!R.empty()) {
           SUnit *maxDepth = nullptr;
           for (SUnit *I : R) {
             if (maxDepth == nullptr || getDepth(I) > getDepth(maxDepth))
               maxDepth = I;
             else if (getDepth(I) == getDepth(maxDepth) &&
-                     getMOV(I) < getMOV(maxDepth) &&
-                     !hasDataDependence(I, maxDepth))
+                     getZeroLatencyDepth(I) > getZeroLatencyDepth(maxDepth))
               maxDepth = I;
-            else if (hasDataDependence(maxDepth, I))
+            else if (getDepth(I) == getDepth(maxDepth) &&
+                     getZeroLatencyDepth(I) == getZeroLatencyDepth(maxDepth) &&
+                     getMOV(I) < getMOV(maxDepth))
               maxDepth = I;
           }
           NodeOrder.insert(maxDepth);
@@ -3864,6 +3891,96 @@ bool SMSchedule::isValidSchedule(SwingSc
   return true;
 }
 
+/// A property of the node order in swing-modulo-scheduling is
+/// that for nodes outside circuits the following holds:
+/// none of them is scheduled after both a successor and a
+/// predecessor.
+/// The method below checks whether the property is met.
+/// If not, debug information is printed and statistics information updated.
+/// Note that we do not use an assert statement.
+/// The reason is that although an invalid node oder may prevent
+/// the pipeliner from finding a pipelined schedule for arbitrary II,
+/// it does not lead to the generation of incorrect code.
+void SwingSchedulerDAG::checkValidNodeOrder(const NodeSetType &Circuits) const {
+
+  // a sorted vector that maps each SUnit to its index in the NodeOrder
+  typedef std::pair<SUnit *, unsigned> UnitIndex;
+  std::vector<UnitIndex> Indices(NodeOrder.size(), std::make_pair(nullptr, 0));
+
+  for (unsigned i = 0, s = NodeOrder.size(); i < s; ++i)
+    Indices.push_back(std::make_pair(NodeOrder[i], i));
+
+  auto CompareKey = [](UnitIndex i1, UnitIndex i2) {
+    return std::get<0>(i1) < std::get<0>(i2);
+  };
+
+  // sort, so that we can perform a binary search
+  std::sort(Indices.begin(), Indices.end(), CompareKey);
+
+  bool Valid = true;
+  // for each SUnit in the NodeOrder, check whether
+  // it appears after both a successor and a predecessor
+  // of the SUnit. If this is the case, and the SUnit
+  // is not part of circuit, then the NodeOrder is not
+  // valid.
+  for (unsigned i = 0, s = NodeOrder.size(); i < s; ++i) {
+    SUnit *SU = NodeOrder[i];
+    unsigned Index = i;
+
+    bool PredBefore = false;
+    bool SuccBefore = false;
+
+    SUnit *Succ;
+    SUnit *Pred;
+
+    for (SDep &PredEdge : SU->Preds) {
+      SUnit *PredSU = PredEdge.getSUnit();
+      unsigned PredIndex =
+          std::get<1>(*std::lower_bound(Indices.begin(), Indices.end(),
+                                        std::make_pair(PredSU, 0), CompareKey));
+      if (!PredSU->getInstr()->isPHI() && PredIndex < Index) {
+        PredBefore = true;
+        Pred = PredSU;
+        break;
+      }
+    }
+
+    for (SDep &SuccEdge : SU->Succs) {
+      SUnit *SuccSU = SuccEdge.getSUnit();
+      unsigned SuccIndex =
+          std::get<1>(*std::lower_bound(Indices.begin(), Indices.end(),
+                                        std::make_pair(SuccSU, 0), CompareKey));
+      if (!SuccSU->getInstr()->isPHI() && SuccIndex < Index) {
+        SuccBefore = true;
+        Succ = SuccSU;
+        break;
+      }
+    }
+
+    if (PredBefore && SuccBefore && !SU->getInstr()->isPHI()) {
+      // instructions in circuits are allowed to be scheduled
+      // after both a successor and predecessor.
+      bool InCircuit = std::any_of(
+          Circuits.begin(), Circuits.end(),
+          [SU](const NodeSet &Circuit) { return Circuit.count(SU); });
+      if (InCircuit)
+        DEBUG(dbgs() << "In a circuit, predecessor ";);
+      else {
+        Valid = false;
+        NumNodeOrderIssues++;
+        DEBUG(dbgs() << "Predecessor ";);
+      }
+      DEBUG(dbgs() << Pred->NodeNum << " and successor " << Succ->NodeNum
+                   << " are scheduled before node " << SU->NodeNum << "\n";);
+    }
+  }
+
+  DEBUG({
+    if (!Valid)
+      dbgs() << "Invalid node order found!\n";
+  });
+}
+
 /// Attempt to fix the degenerate cases when the instruction serialization
 /// causes the register lifetimes to overlap. For example,
 ///   p' = store_pi(p, b)

Modified: llvm/trunk/test/CodeGen/Hexagon/SUnit-boundary-prob.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/Hexagon/SUnit-boundary-prob.ll?rev=326925&r1=326924&r2=326925&view=diff
==============================================================================
--- llvm/trunk/test/CodeGen/Hexagon/SUnit-boundary-prob.ll (original)
+++ llvm/trunk/test/CodeGen/Hexagon/SUnit-boundary-prob.ll Wed Mar  7 10:53:36 2018
@@ -1,8 +1,12 @@
-; RUN: llc -march=hexagon -O2 -mcpu=hexagonv60  < %s | FileCheck %s
+; REQUIRES: asserts
+; RUN: llc -march=hexagon -O2 -mcpu=hexagonv60 --stats -o - 2>&1  < %s | FileCheck %s
 ; This was aborting while processing SUnits.
 
 ; CHECK: vmem
 
+; CHECK-NOT: Number of node order issues found
+; CHECK: Number of loops software pipelined
+; CHECK-NOT: Number of node order issues found
 source_filename = "bugpoint-output-bdb0052.bc"
 target datalayout = "e-m:e-p:32:32:32-a:0-n16:32-i64:64:64-i32:32:32-i16:16:16-i1:8:8-f32:32:32-f64:64:64-v32:32:32-v64:64:64-v512:512:512-v1024:1024:1024-v2048:2048:2048"
 target triple = "hexagon-unknown--elf"

Modified: llvm/trunk/test/CodeGen/Hexagon/frame-offset-overflow.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/Hexagon/frame-offset-overflow.ll?rev=326925&r1=326924&r2=326925&view=diff
==============================================================================
--- llvm/trunk/test/CodeGen/Hexagon/frame-offset-overflow.ll (original)
+++ llvm/trunk/test/CodeGen/Hexagon/frame-offset-overflow.ll Wed Mar  7 10:53:36 2018
@@ -1,9 +1,14 @@
-; RUN: llc -march=hexagon < %s | FileCheck %s
+; REQUIRES: asserts
+; RUN: llc -march=hexagon --stats -o - 2>&1 < %s | FileCheck %s
 
-; In reality, check that the compilation succeeded and that some code was
-; generated.
+; Check that the compilation succeeded and that some code was generated.
 ; CHECK: vadd
 
+; Check that the loop is pipelined and that a valid node order is used.
+; CHECK-NOT: Number of node order issues found
+; CHECK: Number of loops software pipelined
+; CHECK-NOT: Number of node order issues found
+
 target triple = "hexagon"
 
 define void @fred(i16* noalias nocapture readonly %p0, i32 %p1, i32 %p2, i16* noalias nocapture %p3, i32 %p4) local_unnamed_addr #1 {

Modified: llvm/trunk/test/CodeGen/Hexagon/swp-vmult.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/Hexagon/swp-vmult.ll?rev=326925&r1=326924&r2=326925&view=diff
==============================================================================
--- llvm/trunk/test/CodeGen/Hexagon/swp-vmult.ll (original)
+++ llvm/trunk/test/CodeGen/Hexagon/swp-vmult.ll Wed Mar  7 10:53:36 2018
@@ -4,8 +4,8 @@
 ; Multiply and accumulate
 ; CHECK: mpyi([[REG0:r([0-9]+)]],[[REG1:r([0-9]+)]])
 ; CHECK-NEXT: add(r{{[0-9]+}},#4)
-; CHECK-NEXT: [[REG0]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)
-; CHECK-NEXT: [[REG1]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)
+; CHECK-DAG: [[REG1]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)
+; CHECK-DAG: [[REG0]] = memw(r{{[0-9]+}}+r{{[0-9]+}}<<#0)
 ; CHECK-NEXT: endloop0
 
 define i32 @foo(i32* %a, i32* %b, i32 %n) {

Modified: llvm/trunk/test/CodeGen/Hexagon/vect/vect-shuffle.ll
URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/Hexagon/vect/vect-shuffle.ll?rev=326925&r1=326924&r2=326925&view=diff
==============================================================================
--- llvm/trunk/test/CodeGen/Hexagon/vect/vect-shuffle.ll (original)
+++ llvm/trunk/test/CodeGen/Hexagon/vect/vect-shuffle.ll Wed Mar  7 10:53:36 2018
@@ -1,8 +1,12 @@
-; RUN: llc -march=hexagon -mcpu=hexagonv5 -disable-hsdr < %s | FileCheck %s
+; REQUIRES: asserts
+; RUN: llc -march=hexagon -mcpu=hexagonv5 -disable-hsdr --stats -o - 2>&1 < %s | FileCheck %s
 
 ; Check that store is post-incremented.
 ; CHECK-NOT: extractu(r{{[0-9]+}},#32,
 ; CHECK-NOT: insert
+; CHECK-NOT: Number of node order issues found
+; CHECK: Number of loops software pipelined
+; CHECK-NOT: Number of node order issues found
 target datalayout = "e-p:32:32:32-i64:64:64-i32:32:32-i16:16:16-i1:32:32-f64:64:64-f32:32:32-v64:64:64-v32:32:32-a0:0-n16:32"
 target triple = "hexagon"