[PATCH] D81085: [AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size

Wed Jun 3 04:54:03 PDT 2020

hsmhsm created this revision.
hsmhsm added reviewers: foad, rampitec, arsenm, vpykhtin.
Herald added subscribers: llvm-commits, kerbowa, hiraditya, t-tye, Anastasia, tpr, dstuttard, yaxunl, nhaehnle, wdng, jvesely, kzhuravl.
Herald added a project: LLVM.
hsmhsm edited the summary of this revision.
hsmhsm edited the summary of this revision.
hsmhsm edited the summary of this revision.
hsmhsm edited the summary of this revision.

This patch is created so that reviewers can review the implementation of new heuristic for computing max mem ops cluster size, and reserve their judgment. Here are key points:

(1) This patch results in following 12 unit test regressions

- LLVM :: CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.div.fmas.ll
- LLVM :: CodeGen/AMDGPU/amdhsa-trap-num-sgprs.ll
- LLVM :: CodeGen/AMDGPU/global-saddr.ll
- LLVM :: CodeGen/AMDGPU/insert_vector_elt.ll
- LLVM :: CodeGen/AMDGPU/kernel-args.ll
- LLVM :: CodeGen/AMDGPU/memory_clause.ll
- LLVM :: CodeGen/AMDGPU/promote-constOffset-to-imm.ll
- LLVM :: CodeGen/AMDGPU/salu-to-valu.ll
- LLVM :: CodeGen/AMDGPU/sgpr-control-flow.ll
- LLVM :: CodeGen/AMDGPU/shift-i128.ll
- LLVM :: CodeGen/AMDGPU/store-weird-sizes.ll
- LLVM :: CodeGen/AMDGPU/trunc-store-i64.ll

(2) OpenCL SHOC benchmark test - `DGEMM-T` shows 4.6% regression.

(3) All other SHOC benchmark benchmark tests looks fine.

(4) However this patch is required to addrress the performance issue raised in the ticket SWDEV-231869.

(5) We need to take go or no-go decision. Hence, created this review patch, so that reviewers can provide their feedback or some ideas about further tuning of heuristic if possible.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D81085

Files:
  llvm/lib/Target/AMDGPU/SIInstrInfo.cpp


Index: llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
===================================================================

--- llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -467,57 +467,11 @@
   if (!memOpsHaveSameBasePtr(FirstLdSt, BaseOps1, SecondLdSt, BaseOps2))
     return false;
 
-  const MachineOperand *FirstDst = nullptr;
-  const MachineOperand *SecondDst = nullptr;
-
-  if ((isMUBUF(FirstLdSt) && isMUBUF(SecondLdSt)) ||
-      (isMTBUF(FirstLdSt) && isMTBUF(SecondLdSt)) ||
-      (isFLAT(FirstLdSt) && isFLAT(SecondLdSt))) {
-    const unsigned MaxGlobalLoadCluster = 7;
-    if (NumLoads > MaxGlobalLoadCluster)
-      return false;
-
-    FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdata);
-    if (!FirstDst)
-      FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdst);
-    SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdata);
-    if (!SecondDst)
-      SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdst);
-  } else if (isSMRD(FirstLdSt) && isSMRD(SecondLdSt)) {
-    FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::sdst);
-    SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::sdst);
-  } else if (isDS(FirstLdSt) && isDS(SecondLdSt)) {
-    FirstDst = getNamedOperand(FirstLdSt, AMDGPU::OpName::vdst);
-    SecondDst = getNamedOperand(SecondLdSt, AMDGPU::OpName::vdst);
-  }
-
-  if (!FirstDst || !SecondDst)
-    return false;
-
-  // Try to limit clustering based on the total number of bytes loaded
-  // rather than the number of instructions.  This is done to help reduce
-  // register pressure.  The method used is somewhat inexact, though,
-  // because it assumes that all loads in the cluster will load the
-  // same number of bytes as FirstLdSt.
-
-  // The unit of this value is bytes.
-  // FIXME: This needs finer tuning.
-  unsigned LoadClusterThreshold = 16;
-
-  const MachineRegisterInfo &MRI =
-      FirstLdSt.getParent()->getParent()->getRegInfo();
-
-  const Register Reg = FirstDst->getReg();
-
-  const TargetRegisterClass *DstRC = Register::isVirtualRegister(Reg)
-                                         ? MRI.getRegClass(Reg)
-                                         : RI.getPhysRegClass(Reg);
-
-  // FIXME: NumLoads should not be subtracted 1. This is to match behavior
-  // of clusterNeighboringMemOps which was previosly passing cluster length
-  // less 1. LoadClusterThreshold should be tuned instead.
-  return ((NumLoads - 1) * (RI.getRegSizeInBits(*DstRC) / 8)) <=
-         LoadClusterThreshold;
+  assert((NumLoads > 0) && (NumBytes > 0) && (NumBytes >= NumLoads) &&
+         "Invalid NumLoads/NumBytes values");
+  unsigned AverageBytesPerLoad = NumBytes / NumLoads;
+  unsigned MaxNumLoads = AverageBytesPerLoad <= 4 ? 5 : 4;
+  return NumLoads <= MaxNumLoads;
 }
 
 // FIXME: This behaves strangely. If, for example, you have 32 load + stores,


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D81085.268144.patch
Type: text/x-patch
Size: 2894 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200603/2ebf267f/attachment.bin>