[PATCH] D11885: AMDGPU/SI: Add SI Machine Scheduler

Tue Dec 22 01:52:51 PST 2015

axeldavy added a comment.

I agree with you doing with blocks is not optimal.
However it is one way to make the problem simpler to the scheduler and achieve better results.

Ideally some analysis could be used, as you say, to make some good decisions for the scheduling.
However I couldn't find any such good analysis pass yet, and how it could be used.

Grouping the instructions together is not enforced by this scheduler. In fact, one can have a variant that does one instruction == one block,
however so far it gives worse results, sign that this helps the second part of the schedule doing better decisions.

The thing that gave me the idea of blocks is that when you look at the original TGSI you can easily in your mind make some sub-blocks of instruction, and place the high latency instructions futher away form the users and come up with a schedule. I hoped that pass would rebuild such blocks easy to manipulate.

One thing hard during the schedule is when you have ready for schedule a lot of different instructions that all increase register usage.
Which instructions should you schedule now, in order to be able to contain register usage and decrease its usage soon ?
One solution would be to look at a lot of possible schedule order for the k next iterations and take the best one, but that's expensive.
The problem is very likely NP-hard.

Doing with blocks is a way to make it simpler for the scheduler: a relatively good decision has already been made for what are the k next instructions after each one.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:38
@@ +37,3 @@
+// you are interested in to finish.
+// . The less the register pressure, the best load latencies are hidden
+//
----------------
nhaehnle wrote:
> What precisely do you mean by this last point?
This was a try to explain that less registers used => latencies are better hidden (better wavecounts)

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:42-45
@@ +41,6 @@
+// have few dependencies) makes the generic scheduler have some unpredictable
+// behaviours. For example when register pressure becomes high, it can either
+// manage to prevent register pressure from going too high, or it can
+// increase register pressure even more than if it hadn't taken register
+// pressure into account.
+//
----------------
nhaehnle wrote:
> This is hard to understand. Do you have some example of what you mean exactly, and an explanation of why it happens in the generic scheduler?
Imagine you have 100 instructions v_mov REG, CONSTANT, and one VMEM instruction loading two registers.
If you run out of registers and the default scheduler decides to try reduce register usage,
it will schedule all the v_mov instructions first before the VMEM instruction, because it doesn't increase reg usage as much.

If everything else depends on that VMEM, that results in hugh vgpr usage increase.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:473
@@ +472,3 @@
+  for (SDep& Succ : SU->Succs) {
+    releaseSucc(SU, &Succ, InOrOutBlock);
+  }
----------------
nhaehnle wrote:
> Since releaseSucc() is only used here, the logic of whether to actually release or not should be moved out of it and into this loop. It is confusing to have a method releaseXxx which does not actually always release Xxx.
> 
> The same actually holds for the releaseSuccessors method as a whole. This would be helped a lot if the parameter were an enum if descriptive names, such that the meaning of the parameter is clear at call sites.
I don't get the second part of the comment ? Should I replace the boolean by an enum ?

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.cpp:827
@@ +826,3 @@
+
+void SIScheduleBlockCreator::colorEndsAccordingToDependencies() {
+  unsigned DAGSize = DAG->SUnits.size();
----------------
nhaehnle wrote:
> What is this method supposed to do? Which nodes are affected by it?
This is for nodes which have 
CurrentBottomUpReservedDependencyColoring[SU->NodeNum] == 
CurrentTopDownReservedDependencyColoring[SU->NodeNum] == 0

for example if we have (A high lat instruction)
A = op X2
X3 = op A X1

X1 doesn't have any high latency instruction dependency in both ways.

Without this function, all instructions in X1 case are put in the same block.
The pass puts them in different blocks, according to the set of dependencies in the already given colors.

================
Comment at: lib/Target/AMDGPU/SIMachineScheduler.h:226-230
@@ +225,7 @@
+
+  // 0 -> Color not given.
+  // 1 to SUnits.size() -> Reserved group (you should only add elements to them).
+  // Above -> Other groups.
+  int NextReservedID;
+  int NextNonReservedID;
+  std::vector<int> CurrentColoring;
----------------
nhaehnle wrote:
> I do not understand the semantics of reserved vs. non-reserved. Why the distinction? What does it mean?
Reserved: You can only add elements
Non-Reserved: you can change the color of all the elements if you wish.

If you have better names, don't hesitate...

http://reviews.llvm.org/D11885