[PATCH] D50633: [AMDGPU] Add new Mode Register pass

Mon Nov 5 03:51:44 PST 2018

nhaehnle added a comment.

As an overall algorithmic remark: I like the organization of the pass into phases, because it provides a path forward to an additional optimization.

The current approach could be described as a straightforward fixed-point iteration over a lattice describing modes at each instruction that uses a mode. This is good and a perfectly fine approach.

However, consider the case of a loop that consistently uses a specific mode setting inside the loop, while code before the loop uses a different mode setting. The current code will introduce an s_setreg_imm32_b32 inside the loop, since the predecessor state at the beginning of the loop header is unknown. Instead, we could add an s_setreg_imm32_b32 at the end of the loop predecessor block.

The organization into separate passes allows a later modification of pass 2 & 3 into something more advanced that can take this possibility into account. I'm not saying that this optimization should necessarily be done with this change, but I just want people to be aware of it; also, it serves as an additional explanation for one of my inline comments :)

================
Comment at: lib/Target/AMDGPU/SIModeRegister.cpp:160-180
+// Determine the Mode register setting required for this instruction.
+// Instructions which don't use the Mode register return a null Status.
+// Note this currently only deals with instructions that use the floating point
+// double precision setting.
+Status SIModeRegister::getInstructionMode(MachineInstr &MI,
+                                          const SIInstrInfo *TII) {
+  if (TII->usesFPDPRounding(MI)) {
----------------
timcorringham wrote:
> arsenm wrote:
> > arsenm wrote:
> > > timcorringham wrote:
> > > > arsenm wrote:
> > > > > I'm not really comfortable inserting something semantically required at this point. Can you do this when the instructions are selected instead?
> > > > The problem with doing it during instruction selection is that we end up with many more mode register writes than are strictly required. As the mode register is not modelled as a register there isn't any way to track the values without a pass to do it, I suppose it is similar to adding nops or waitcnts, which are also done by specific passes.
> > > This is a semantic property and I think it really belongs in instruction selection. What is the problem with optimizing those out here? What actually changes?
> > Or even earlier, a property of the emitted operation
> An initial attempt at this functionality did insert necessary mode register writes at instruction selection, but that resulted in many more changes than were necessary as the state of the register isn't known at that point. In order to avoid some of  the changes all instructions that use the mode register would have to be updated to ensure the mode was appropriate - which was considered too invasive. A pass to remove unnecessary setregs would be possible, but would be very similar to this pass, but would still require changes to many instructions to insert the setregs, and would also provide extra overhead for all intervening passes.  There would have to be some dependence introduced between the mode register and the instructions that use it to ensure any rescheduling didn't break the code.
> Overall we thought that this approach was the best compromise. It solves the immediate problem with minimal overhead, and can be extended in a staged manner fairly easily. 
I tend to agree with Tim here. If we first emit setregs everywhere just to remove most of them later, the pass that removes them will look pretty much identical to this pass, while in the meantime slowing down code generation elsewhere, and possibly even pessimizing things (e.g. restricting machine scheduling).

================
Comment at: lib/Target/AMDGPU/SIModeRegister.cpp:90-92
+  // The Status that represents the mode register settings required on entry to
+  // this block. Calculated in Phase 1.
+  Status Require;
----------------
Technically not required on entry, but required at the FirstInsertionPoint.

================
Comment at: lib/Target/AMDGPU/SIModeRegister.cpp:119
+
+  std::vector<BlockData *> BlockInfo;
+  std::queue<MachineBasicBlock *> Phase2List;
----------------
Use smart pointers (unique_ptr should do the trick).

================
Comment at: lib/Target/AMDGPU/SIModeRegister.cpp:237-257
+      } else if (InsertionPoint) {
+        // We need a setreg at the InsertionPoint, but we defer the first in
+        // each block to Phase 3
+        if (RequirePending) {
+          NewInfo->Require = NewInfo->Change;
+          RequirePending = false;
+        } else {
----------------
I feel like this logic is more convoluted than necessary, and possibly even wrong / overly conservative in some cases because of that.

For example, why are you setting InsertionPoint also in the case where FirstInsertionPoint is set? Why are IPChange and NewInfo->Change never cleared? My intuition is that the logic should just be a delayed-update/insert pattern like this:
```
if (!NewInfo->FirstInsertionPoint) {
  NewInfo->FirstInsertionPoint = &MI;
} else if (!IPChange.isCompatible(InstrMode)) {
  if (InsertionPoint)
    insertSetReg(IPChange);
  else
    NewInfo->Require = IPChange;
  merge IPChange into NewInfo->Change and reset IPChange
  InsertionPoint = &MI;
}
merge InstrMode into IPChange
```
Then at the end of the basic block:
```
if (!InsertionPoint)
  NewInfo->Require = IPChange;
```
The RequirePending flag should simply be unnecessary.

================
Comment at: lib/Target/AMDGPU/SIModeRegister.cpp:272-276
+      if (InsertionPoint) {
+        // We need to insert a setreg at the InsertionPoint
+        insertSetreg(MBB, InsertionPoint, TII, IPChange.delta(NewInfo->Change));
+        InsertionPoint = nullptr;
+      }
----------------
It would be more intuitive to guard this by IPChange being "non-empty". In the InsertionPoint case, insert the SETREG; otherwise, set NewInfo->Require. In both cases, IPChange should be reset.

The overall logic can then be described as:
* NewInfo->Change describes the current status of the mode registers as we know it
* IPChange describes the pending mode changes that need to be applied at InsertionPoint (if non-null) or NewInfo->Require (otherwise)

================
Comment at: lib/Target/AMDGPU/SIModeRegister.cpp:296
+  BlockInfo[MBB.getNumber()] = NewInfo;
+  Phase2List.push(&MBB);
+}
----------------
Remove this from here, it isn't conceptually part of phase 1. Initialize that list as part of external driver code.

Repository:
  rL LLVM

https://reviews.llvm.org/D50633