[llvm] [AMDGPU][SIInsertWaitCnts] Gfx12.5 - Refactor xcnt optimization (PR #164357)

Tue Nov 4 11:31:23 PST 2025

================
@@ -2160,19 +2168,11 @@ bool SIInsertWaitcnts::generateWaitcnt(AMDGPU::Waitcnt Wait,
                       << "Update Instr: " << *It);
   }
 
-  // XCnt may be already consumed by a load wait.
-  if (Wait.XCnt != ~0u) {
-    if (Wait.KmCnt == 0 && !ScoreBrackets.hasPendingEvent(SMEM_GROUP))
-      Wait.XCnt = ~0u;
-
-    if (Wait.LoadCnt == 0 && !ScoreBrackets.hasPendingEvent(VMEM_GROUP))
-      Wait.XCnt = ~0u;
-
-    // Since the translation for VMEM addresses occur in-order, we can skip the
-    // XCnt if the current instruction is of VMEM type and has a memory
-    // dependency with another VMEM instruction in flight.
-    if (isVmemAccess(*It))
-      Wait.XCnt = ~0u;
+  // Since the translation for VMEM addresses occur in-order, we can skip the
+  // XCnt if the current instruction is of VMEM type and has a memory
+  // dependency with another VMEM instruction in flight.
+  if (Wait.XCnt != ~0u && isVmemAccess(*It)) {
+    Wait.XCnt = ~0u;
----------------
RyanRio wrote:

Maybe this could be simplified further by moving it into the updateByEvent pathway...

```
  } else if (T == X_CNT) {
    WaitEventType OtherEvent = E == SMEM_GROUP ? VMEM_GROUP : SMEM_GROUP;
    if (PendingEvents & (1 << OtherEvent)) {
      // Hardware inserts an implicit xcnt between interleaved
      // SMEM and VMEM operations. So there will never be
      // outstanding address translations for both SMEM and
      // VMEM at the same time.
      setScoreLB(T, getScoreUB(T) - 1);
      PendingEvents &= ~(1 << OtherEvent);
    }
    for (const MachineOperand &Op : Inst.all_uses())
      setScoreByOperand(&Inst, Op, T, CurrScore);
  }
```
As there's really no need for the xcnt != ~0u check.


https://github.com/llvm/llvm-project/pull/164357