[llvm] [AMDGPU] Optimize LDS DMA soft waitcnt (PR #138802)

Tue Jun 17 20:35:50 PDT 2025

================
@@ -1278,6 +1278,23 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
     if (Opcode == AMDGPU::S_WAITCNT) {
       unsigned IEnc = II.getOperand(0).getImm();
       AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
+
+      // These pseudo waitcnt instructions are only needed to synchronize DS
+      // operations with direct LDS loads that use vmcnt. We can safely relax
+      // them when no outstanding direct LDS loads exist, even if other vmcnt
+      // events are pending.
+      if (II.getOpcode() == AMDGPU::S_WAITCNT_DIRECT_LDS_LOAD_soft &&
----------------
ssahasra wrote:

Why does the opcode need to identify specifically direct loads to LDS? We can do it with just S_WAITCNT, right? As far as I can see, the memory model is this: release fences must wait for any direct loads to LDS if those fences carry the LDS address space. Then the opcode should be "AMDGPU::S_WAITCNT_LDS". It's more interesting that actually we only need to do this on release fences and not on pure acquire fences.

https://github.com/llvm/llvm-project/pull/138802