[PATCH] D11885: AMDGPU/SI: Add SI Machine Scheduler

Mon Dec 21 18:53:56 PST 2015

nhaehnle added a subscriber: nhaehnle.
nhaehnle added a comment.

I've tried to understand what this scheduler does. As a general comment, I think it would be useful for the review if you decided upon a single variant for each of the steps of the scheduler. Having different variants makes things harder to follow.

I do like the idea of explicitly keeping track of VMEM instructions, since they have both the highest latency and the best latency-mitigation tool. I have focused on the blocks so far, and I have to say I'm not convinced that they are the right way to go. Some high-level thoughts on the topic:

1. As a general approach that would fit better into the MachineSchedStrategy framework as I understand it, how about a bottom-up scheduling in which after you schedule the instruction consuming the result of a VMEM load, you keep track of how many other instructions (or better: cycles) you have already scheduled to cover the latency, and use that in decision making.

1b) There will be dependency chains between VMEMs (dependent texture loads in graphics, more general patterns in compute). For such dependency chains, it seems like a good idea to do an analysis prepass (as part of the scheduler) which collects information about stuff like how many DAG nodes are strictly between two dependent VMEM loads. This kind of information can be useful: consider a dependent texture load with a single dependency step in a fragment shader that looks like this: (interp) -> image -> (long calculation) -> image -> (short calculation) -> exp. The fact that one  of these calculations is short and the other long can be obtained by a DAG traversal similar to your coloring, and it can be used to decide that the second image instruction should not be scheduled too early (in program order, or too late in bottom-up scheduling order).

2. I've noticed while playing around that the blocks are of very different sizes. This alone made me a bit suspicious.

3. The blocks are a form of very early decision making, and as a general rule, very early decision making tends to lead to bad results in optimization, unless it is done very well.

None of this is necessarily the final nail in the coffin for the blocks idea. I'd just like to see a better justification why scheduling at an instruction level is really insufficient. (I do think that some bigger-picture early analysis of the DAG to take latency into account is likely to be very useful.)


================
Comment at: lib/Target/AMDGPU/AMDGPUTargetMachine.cpp:25
@@ -24,2 +24,3 @@
 #include "SIInstrInfo.h"
+#include "SIMachineScheduler.h"
 #include "llvm/Analysis/Passes.h"
----------------
To keep recompile times lower, it would be good to get rid of this #include; AFAICT the only dependency is createSIMachineScheduler, which can be moved to SIMachineScheduler.cpp (and declared in AMDGPU.h).

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:38
@@ +37,3 @@
+// you are interested in to finish.
+// . The less the register pressure, the best load latencies are hidden
+//
----------------
What precisely do you mean by this last point?

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:42-45
@@ +41,6 @@
+// have few dependencies) makes the generic scheduler have some unpredictable
+// behaviours. For example when register pressure becomes high, it can either
+// manage to prevent register pressure from going too high, or it can
+// increase register pressure even more than if it hadn't taken register
+// pressure into account.
+//
----------------
This is hard to understand. Do you have some example of what you mean exactly, and an explanation of why it happens in the generic scheduler?

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:128
@@ +127,3 @@
+
+const char *getReasonStr(SIScheduleCandReason Reason) {
+  switch (Reason) {
----------------
This function should be static.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:473
@@ +472,3 @@
+  for (SDep& Succ : SU->Succs) {
+    releaseSucc(SU, &Succ, InOrOutBlock);
+  }
----------------
Since releaseSucc() is only used here, the logic of whether to actually release or not should be moved out of it and into this loop. It is confusing to have a method releaseXxx which does not actually always release Xxx.

The same actually holds for the releaseSuccessors method as a whole. This would be helped a lot if the parameter were an enum if descriptive names, such that the meaning of the parameter is clear at call sites.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:717-753
@@ +716,39 @@
+
+  for (unsigned i = 0, e = DAGSize; i != e; ++i) {
+    SUnit *SU = &DAG->SUnits[DAG->TopDownIndex2SU[i]];
+    std::set<unsigned> SUColors;
+
+    // Already given.
+    if (CurrentColoring[SU->NodeNum]) {
+      CurrentTopDownReservedDependencyColoring[SU->NodeNum] =
+        CurrentColoring[SU->NodeNum];
+      continue;
+    }
+
+   for (SDep& PredDep : SU->Preds) {
+      SUnit *Pred = PredDep.getSUnit();
+      if (PredDep.isWeak() || Pred->NodeNum >= DAGSize)
+        continue;
+      if (CurrentTopDownReservedDependencyColoring[Pred->NodeNum] > 0)
+        SUColors.insert(CurrentTopDownReservedDependencyColoring[Pred->NodeNum]);
+    }
+    // Color 0 by default.
+    if (SUColors.empty())
+      continue;
+    // Same color than parents.
+    if (SUColors.size() == 1 && *SUColors.begin() > DAGSize)
+      CurrentTopDownReservedDependencyColoring[SU->NodeNum] =
+        *SUColors.begin();
+    else {
+      std::map<std::set<unsigned>, unsigned>::iterator Pos =
+        ColorCombinations.find(SUColors);
+      if (Pos != ColorCombinations.end()) {
+          CurrentTopDownReservedDependencyColoring[SU->NodeNum] = Pos->second;
+      } else {
+        CurrentTopDownReservedDependencyColoring[SU->NodeNum] =
+          NextNonReservedID;
+        ColorCombinations[SUColors] = NextNonReservedID++;
+      }
+    }
+  }
+
----------------
This algorithm will give different colors to SUnits with different dependencies, but it does not always give the same color to SUnits with the same dependencies. Consider:

High latency instructions A, B, C, followed by:
 X1 = op A, B (has predecessors A, B, gets new color N)
 X2 = op X1, C (has predecessor colors N, C, gets new color N + 1)
 Y1 = op A, C (has predecessors A, C, gets new color N + 2)
 Y2 = op Y1, B (has predecessor colors N + 2, B, gets new color N + 3)
X2 and Y2 have the same set of high latency predecessors, but are colored differently.

Not sure whether this is critical, but it's something to keep in mind.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:757
@@ +756,3 @@
+
+  // Same than before, but BottomUp.
+
----------------
Grammar: same *as* before

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:800
@@ +799,3 @@
+  unsigned DAGSize = DAG->SUnits.size();
+  std::map<std::set<unsigned>, unsigned> ColorCombinations;
+
----------------
Use std::pair instad of std::set here.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:827
@@ +826,3 @@
+
+void SIScheduleBlockCreator::colorEndsAccordingToDependencies() {
+  unsigned DAGSize = DAG->SUnits.size();
----------------
What is this method supposed to do? Which nodes are affected by it?

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:942
@@ +941,3 @@
+
+void SIScheduleBlockCreator::colorMergeIfPossibleNextGroupOnlyForReserved() {
+  unsigned DAGSize = DAG->SUnits.size();
----------------
This method and the two above it are extremely similar. It should be possible to merge them somehow.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.h:190
@@ +189,3 @@
+
+  void NodeScheduled(SUnit *SU);
+  void tryCandidateTopDown(SISchedCandidate &Cand, SISchedCandidate &TryCand);
----------------
Style: first letter of method name should be lower case

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.h:207
@@ +206,3 @@
+    LatenciesAlone,
+    LatenciesGroupped,
+    LatenciesAlonePlusConsecutive
----------------
Spelling: Grouped

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.h:226-230
@@ +225,7 @@
+
+  // 0 -> Color not given.
+  // 1 to SUnits.size() -> Reserved group (you should only add elements to them).
+  // Above -> Other groups.
+  int NextReservedID;
+  int NextNonReservedID;
+  std::vector<int> CurrentColoring;
----------------
I do not understand the semantics of reserved vs. non-reserved. Why the distinction? What does it mean?

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.cpp:669-684
@@ +668,18 @@
+
+unsigned SIRegisterInfo::getWaveFrontsForUsage(AMDGPUSubtarget::Generation gen,
+                                               unsigned SGPRsUsed,
+                                               unsigned VGPRsUsed) const {
+  unsigned i;
+  const unsigned *SGPRsForWaveFrontsForGen =
+    gen >= AMDGPUSubtarget::VOLCANIC_ISLANDS ?
+      &SGPRsForWaveFrontsVI[0] : &SGPRsForWaveFronts[0];
+
+  for (i = 9; i > 0; --i) {
+    if (SGPRsForWaveFrontsForGen[i] >= SGPRsUsed &&
+        VGPRsForWaveFronts[i] >= VGPRsUsed)
+      break;
+  }
+
+  return i + 1;
+}
+
----------------
I believe this method is unused.

================
Comment at: lib/Target/AMDGPU/SIRegisterInfo.h:28-30
@@ -27,1 +27,5 @@
 private:
+  unsigned SGPRsForWaveFronts[10];
+  unsigned SGPRsForWaveFrontsVI[10];
+  unsigned VGPRsForWaveFronts[10];
+  unsigned SGPR32SetID;
----------------
Perhaps a constant or enum could be used here instead of the magic 10.


http://reviews.llvm.org/D11885