[PATCH] D30038: [ADMGPU] SDWA peephole optimization pass.

Thu Feb 16 10:26:52 PST 2017

arsenm added inline comments.

================
Comment at: lib/Target/AMDGPU/AMDGPUTargetMachine.cpp:672-675
+  if (EnableSDWAPeephole) {
+    addPass(&SIPeepholeSDWAID);
+    addPass(&DeadMachineInstructionElimID);
+  }
----------------
This seems late to me, and also introduces a 3rd run of DeadMachineInstructionElim. I would expect to run this right after SIFoldOperands. You have code dealing with VOPC/vcc required uses, but that is of limited help right after the pre-RA run. I consider it a bug that we see any of those at all at this point. 

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:23
+
+#include <unordered_map>
+
----------------
Should be included last. 

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:59
+
+  typedef SmallVector<std::shared_ptr<SDWAOperand>, 4> SDWAOperandsVector;
+
----------------
Should not need shared_ptr, unique_ptr is enough. The struct is also small enough to store values of

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:137
+
+inline raw_ostream& operator<<(raw_ostream &OS, const SdwaSel &Sel) {
+  switch(Sel) {
----------------
usual style is static functions without anon namespace

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:163
+  OS << "SDWA src:" << *Src.getTargetOperand()
+     << " src_sel:" << Src.getSrcSel() << "\n";
+  return OS;
----------------
Single quotes

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:170
+     << " dst_sel:" << Dst.getDstSel()
+     << " dst_unused:" << Dst.getDstUnused() << "\n";
+  return OS;
----------------
ditto

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:174
+
+bool isSameBB(const MachineInstr *FirstMI, const MachineInstr *SecondMI) {
+  assert(FirstMI && SecondMI);
----------------
This function shouldn't be necessary. I think all you want to do is check the blocks have the same parent? You shouldn't be seeing situations where you're seeing instructions from multiple functions

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:205
+
+MachineRegisterInfo *SDWAOperand::getMRI() const {
+  if (MachineInstr *MI = getParentInst()) {
----------------
I think this should be removed. There should be no cases where MRI is unavailable, so the null checks here and on its callers are unnecessary. It would be better to just pass around the one MRI reference

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:221
+    for (MachineOperand Def: ParentMI->defs()) {
+      for (MachineInstr &PotentialMI: MRI->use_instructions(Def.getReg())) {
+        // Check if this instructions are in same basic block
----------------
I think this is too aggressive. I would expect to restrict the number of uses to fold. In a typical case if the shift has more than one use, you are increasing code size. I would also expect a legality check here, because if there is an unfoldable user, there's no point in folding any of them.

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:235
+  // target operand. Set corresponding src_sel
+  for (MachineOperand &Operand: MI.explicit_uses()) {
+    if (!Operand.isReg() || Operand.getReg() != getSourceOperand()->getReg()) {
----------------
I think you're going through too much trouble to loop over explicit_uses and then figuring out the operand index. In similar places we just directly query src0/src1

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:305-307
+  // Remove original instruction  because it would conflict with our new
+  // instruction by register definition
+  getParentInst()->eraseFromParent();
----------------
You could create a new virtual register and replace all uses with. If you are going through the trouble of deleting the leftover instructions yourself there's no need to run dead mi elimination again

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:312-314
+  // FIXME: It is possible to simplify this patterns because this pass is runnig
+  // after SIShrinkInstruction => we can get rid of _e64 instructions and we
+  // don't have to check for VGPRs
----------------
This reasoning seems backwards to me. If you run before SIShrinkInstructions you shouldn't need to worry about the e32 case (which is part of why we try to keep things in VOP3 form until necessary)

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:330-333
+      MachineOperand *Dst =  TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
+      if (!Dst->isReg() || !TRI->isVGPR(*MRI, Dst->getReg()))
+        break;
+
----------------
You don't need to check this. The instruction is broken and will fail the verifier

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:342
+    case AMDGPU::V_LSHLREV_B32_e32:
+    case AMDGPU::V_LSHLREV_B32_e64: {
+      // from: v_lshlrev_b32_e32 v1, 16, v0
----------------
For a next round patch I would expect and x, bitmask to select the low half

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:345-347
+      MachineOperand *Dst = TII->getNamedOperand(MI, AMDGPU::OpName::vdst);
+      if (!Dst->isReg() || !TRI->isVGPR(*MRI, Dst->getReg()))
+        break;
----------------
Ditto

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:369
+
+bool SIPeepholeSDWA::isConvertibleToSDWA(MachineInstr &MI) {
+  // Check if this instruction can be converted to SDWA:
----------------
I would expect to consolidate this with convertToSDWA and return false if it can't be converted

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:407-410
+  assert(
+    Src0 &&
+    AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src0) != -1 &&
+    AMDGPU::getNamedOperandIdx(SDWAOpcode, AMDGPU::OpName::src0_modifiers) != -1);
----------------
I think these asserts are overly aggressive

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:451
+  // Apply all sdwa operand pattenrs
+  for (std::shared_ptr<SDWAOperand> Operand: SDWAOperands) {
+    Operand->convertToSDWA(*SDWAInst, TII);
----------------
This should be a const reference. shared_ptr is pretty expensive

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:455-456
+
+  dbgs() << "Convert instruction:" << MI;
+  dbgs() << "Into:" << *SDWAInst << '\n';
+  ++NumSDWAInstructionsPeepholed;
----------------
Single dbg statement, and also should be wrapped in DEBUG()

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:466
+  if (!ST.hasSDWA() ||
+      !AMDGPU::isVI(ST)) { // FIXME: We don't support SDWA anywhere other than VI
+    return false;
----------------
This check is unnecessary?

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:476
+
+  // FIXME: For now we only combine instructions in one basic block
+  for (MachineBasicBlock &MBB: MF) {
----------------
This doesn't seem like a problem to me. I would expect combinable patterns to be sunk down earlier

================
Comment at: lib/Target/AMDGPU/SIPeepholeSDWA.cpp:479
+    SDWAOperands.clear();
+    matchSDWAOperands(MBB);
+
----------------
A better name for this might be collectSDWABitExtracts? I also think you should try folding the sources into the uses in the same loop. Once you see one of these defs, it's use is going to be later, so you don't need to collect every possible use in the block ahead of time.

You could also look at the sources for SDWA convertible instructions instead of looking at this from the other direction.

================
Comment at: test/CodeGen/AMDGPU/sdwa-peephole.ll:1
+; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s | FileCheck -check-prefix=NOSDWA -check-prefix=FUNC %s
+; RUN: llc -march=amdgcn -mcpu=fiji --amdgpu-sdwa-peephole -verify-machineinstrs < %s | FileCheck -check-prefix=SDWA -check-prefix=FUNC %s
----------------
You can replace func with GCN. This will never be shared with r600

================
Comment at: test/CodeGen/AMDGPU/sdwa-peephole.ll:59
+
+define void @vector2x16(<2 x i16> addrspace(1)* %out, <2 x i16> addrspace(1)* %ina, <2 x i16> addrspace(1)* %inb) {
+entry:
----------------
The naming convention in other tests is to include the operation in the name, in this case mul, and then suffix with the llvm type, _v2i16

================
Comment at: test/CodeGen/AMDGPU/sdwa-peephole.ll:66
+  ret void
+}
----------------
This needs a lot more tests. I would like to see a range of tests using half as well. These tests are not also stressing some of the illegal cases you check for, like SGPR operands. There should also be multiple uses tests, with some foldable and unfoldable uses.

There's also a lot of code for checking parent blocks, but these tests all only have one block.

https://reviews.llvm.org/D30038