[llvm] [AMDGPU] Add DS loop wait optimization infrastructure (1/4) (PR #171942)

Fri Dec 12 02:37:02 PST 2025

================
@@ -2643,6 +2670,85 @@ bool SIInsertWaitcnts::isVMEMOrFlatVMEM(const MachineInstr &MI) const {
   return SIInstrInfo::isVMEM(MI);
 }
 
+//===----------------------------------------------------------------------===//
+// DS Loop Wait Optimization (GFX12+)
+//
+// This optimization relaxes DS wait counts in single-block loops that have
+// many DS loads and WMMA/MFMA instructions (typical GEMM kernels with software
+// pipelining). Instead of waiting for almost all DS loads to complete before
+// each WMMA, we analyze which specific loads feed each WMMA and wait only for
+// those to complete, allowing more overlap between memory and compute.
+//
+// Opportunity arises when the load ordering in the preheader block and
+// the load ordering at the end of the loop body, feeding the loaded data
+// to the next iteration, are not matched well (since their orderings are
+// not co-optimized)
+//===----------------------------------------------------------------------===//
+
+bool SIInsertWaitcnts::isEligibleForDSLoopOpt(MachineLoop *ML,
+                                              LoopDSWaitOptInfo &Info) const {
+  if (!OptimizeDSLoopWaitcnt)
+    return false;
+
+  // Only for GFX12+ where we have a separate counter for LDS.
+  if (!ST->hasExtendedWaitCounts())
+    return false;
----------------
arsenm wrote:

```suggestion
  // Only for GFX12+ where we have a separate counter for LDS.
  assert(ST->hasExtendedWaitCounts());
```

The caller already checked this 

https://github.com/llvm/llvm-project/pull/171942