[PATCH] D30038: [ADMGPU] SDWA peephole optimization pass.

Fri Feb 17 03:17:05 PST 2017

SamWot added inline comments.

================
Comment at: lib/Target/AMDGPU/AMDGPUTargetMachine.cpp:672-675
+  if (EnableSDWAPeephole) {
+    addPass(&SIPeepholeSDWAID);
+    addPass(&DeadMachineInstructionElimID);
+  }
----------------
arsenm wrote:
> This seems late to me, and also introduces a 3rd run of DeadMachineInstructionElim. I would expect to run this right after SIFoldOperands. You have code dealing with VOPC/vcc required uses, but that is of limited help right after the pre-RA run. I consider it a bug that we see any of those at all at this point. 
SDWA peephole extensively use results of SIShrinkInstructions pass. I considered placing peephole after SiFoldOperands but at that point we still have all instructions in 64-bit encoding - but they can't be encoded as SDWA. So SDWA pass would in many ways repeat work done by SiShrinkInstruction. Possibly we can move both shrink instructions and SDWA peephole earlier.

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:59
+
+  typedef SmallVector<std::shared_ptr<SDWAOperand>, 4> SDWAOperandsVector;
+
----------------
arsenm wrote:
> Should not need shared_ptr, unique_ptr is enough. The struct is also small enough to store values of
I can't store by value because it would lead to type slicing.

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:174
+
+bool isSameBB(const MachineInstr *FirstMI, const MachineInstr *SecondMI) {
+  assert(FirstMI && SecondMI);
----------------
arsenm wrote:
> This function shouldn't be necessary. I think all you want to do is check the blocks have the same parent? You shouldn't be seeing situations where you're seeing instructions from multiple functions
Yes, this is overkill but if I check toher thing here then why not to check that functions are also same, just for confidence?

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:179
+  assert(FirstBB && SecondBB);
+  const MachineFunction *FirstFN = FirstBB->getParent();
+  const MachineFunction *SecondFN = SecondBB->getParent();
----------------
rampitec wrote:
> Looks like you only need FirstBB == SecondBB.
This should work but I don't want to campare pointers. If internal structure of MachineFunction changes later then this would not work. Also basic block number is exactly what we need to check if this is same basic block - then why not to use it?
What I should really fix is that  `FirstMI->getParent()->getNumber();` should be `FirstBB->getNumber()`.

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:221
+    for (MachineOperand Def: ParentMI->defs()) {
+      for (MachineInstr &PotentialMI: MRI->use_instructions(Def.getReg())) {
+        // Check if this instructions are in same basic block
----------------
arsenm wrote:
> I think this is too aggressive. I would expect to restrict the number of uses to fold. In a typical case if the shift has more than one use, you are increasing code size. I would also expect a legality check here, because if there is an unfoldable user, there's no point in folding any of them.
I agree that we should limit number of uses. This is what I wanted to say with "Introduce mnemonics to limit when SDWA patterns should apply".
One thing that I don't know is if one instruction uses register twice would it appear in `use_instructions()` twice ore single time.

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:466
+  if (!ST.hasSDWA() ||
+      !AMDGPU::isVI(ST)) { // FIXME: We don't support SDWA anywhere other than VI
+    return false;
----------------
rampitec wrote:
> arsenm wrote:
> > This check is unnecessary?
> ```
> >= VOLCANIC_ISLANDS?
> ```
We don't yet support SDWA on gfx9. 

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:479
+    SDWAOperands.clear();
+    matchSDWAOperands(MBB);
+
----------------
arsenm wrote:
> A better name for this might be collectSDWABitExtracts? I also think you should try folding the sources into the uses in the same loop. Once you see one of these defs, it's use is going to be later, so you don't need to collect every possible use in the block ahead of time.
> 
> You could also look at the sources for SDWA convertible instructions instead of looking at this from the other direction.
This method collects not only BitExtracts but also BitInserts.

Splitting processes of finding SDWA patterns and folding sources allows to generalize finding patterns not only for source operands but also for destination operands. With that we can fold whole bunch of operands into instruction instead of folding instructions. Also I think this split makes pass much more clear and maintainable.

https://reviews.llvm.org/D30038