[llvm] [AMDGPU] Optimize LDS DMA soft waitcnt (PR #138802)

Thu Jun 19 01:57:47 PDT 2025

================
@@ -1278,6 +1278,23 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
     if (Opcode == AMDGPU::S_WAITCNT) {
       unsigned IEnc = II.getOperand(0).getImm();
       AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
+
+      // These pseudo waitcnt instructions are only needed to synchronize DS
+      // operations with direct LDS loads that use vmcnt. We can safely relax
+      // them when no outstanding direct LDS loads exist, even if other vmcnt
+      // events are pending.
+      if (II.getOpcode() == AMDGPU::S_WAITCNT_DIRECT_LDS_LOAD_soft &&
----------------
jayfoad wrote:

> Yes, that's a problem with the current separation of concerns between the memory legalizer and the waitcnt inserter.

I don't understand why you call this a "problem". Maybe it doesn't work in exactly the way you would like, but I think it does work perfectly well.

The legalizer _is_ safe by default, which means it always inserts a zero count wherever a wait of any kind is required. The inserter _can_ safely relax the zero count based on its scores, but this is implemented without the inserter having to know anything about fences/invalidates/etc specifically; instead, everything the inserter needs to know is in the waitcnt instruction itself.

For most waitcnts, no extra information is needed; it's just a regular waitcnt. For the new case in this patch, the extra information is "this is a wait on vmcnt but I'm only interested in VMEM load-to-lds instructions".

https://github.com/llvm/llvm-project/pull/138802