[llvm] Co-issue packed instructions by unpacking (PR #151704)

Mon Aug 25 11:48:14 PDT 2025

================
@@ -225,6 +254,712 @@ bool GCNPreRAOptimizationsImpl::processReg(Register Reg) {
   return true;
 }
 
+bool GCNPreRAOptimizationsImpl::isUnpackingSupportedInstr(
+    MachineInstr &MI) const {
+  unsigned Opcode = MI.getOpcode();
+  switch (Opcode) {
+  case AMDGPU::V_PK_ADD_F32:
+  case AMDGPU::V_PK_MUL_F32:
+  case AMDGPU::V_PK_MUL_F16:
+  case AMDGPU::V_PK_ADD_F16:
+  case AMDGPU::V_PK_FMA_F32:
+    return true;
+
+  default:
+    return false;
+  }
+}
+
+uint16_t GCNPreRAOptimizationsImpl::mapToUnpackedOpcode(MachineInstr &I) {
+  unsigned Opcode = I.getOpcode();
+  // use 64 bit encoding to allow use of VOP3 instructions.
+  // VOP3 instructions allow VOP3P source modifiers to be translated to VOP3
+  // e32 instructions are VOP2 and don't allow source modifiers
+  switch (Opcode) {
+  case AMDGPU::V_PK_ADD_F32:
+    return AMDGPU::V_ADD_F32_e64;
+  case AMDGPU::V_PK_MUL_F32:
+    return AMDGPU::V_MUL_F32_e64;
+  case AMDGPU::V_PK_ADD_F16:
+    return AMDGPU::V_ADD_F16_e64;
+  case AMDGPU::V_PK_MUL_F16:
+    return AMDGPU::V_MUL_F16_e64;
+  case AMDGPU::V_PK_FMA_F32:
+    return AMDGPU::V_FMA_F32_e64;
+  default:
+    return std::numeric_limits<uint16_t>::max();
+  }
+}
+
+bool GCNPreRAOptimizationsImpl::createListOfPackedInstr(
+    MachineInstr &BeginMI, SetVector<MachineInstr *> &InstrsToUnpack,
+    uint16_t NumMFMACycles) {
+  auto *BB = BeginMI.getParent();
+  auto *MF = BB->getParent();
+  int NumInst = 0;
+
+  auto E = BB->end();
+
+  int TotalCyclesBetweenCandidates = 0;
+  auto SchedModel = TII->getSchedModel();
+  for (auto I = std::next(BeginMI.getIterator()); I != E; ++I) {
+    MachineInstr &Instr = *I;
+    const MCSchedClassDesc *InstrSchedClassDesc =
+        SchedModel.resolveSchedClass(&Instr);
+    TotalCyclesBetweenCandidates +=
----------------
jrbyrnes wrote:

This cycle count modelling is inaccurate and doesn't properly account for observed latency between instructions. If there are dependencies in the instructions between the MFMA and unpack instruction, we must observe the full latency of the dependee.

For example, if we have a load and a use of the load between the MFMA and the unpack instruction, we will incur the full latency of the load between the MFMA and the unpack candidate.

Another interesting case is if the unpack candidate uses some value from the MFMA. In this case, the unpack candidate must wait for the MFMA regardless.

https://github.com/llvm/llvm-project/pull/151704