[llvm] [AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer (PR #150167)

Thu Jul 24 03:52:04 PDT 2025

================
@@ -1381,6 +1383,32 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
         Modified = true;
       } else
         WaitcntInstr = &II;
+    } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+      // Each direct load to LDS is also a store to LDS, but we do not have a
+      // separate counter for it. Instead these operations increment LOAD_CNT
+      // and need to be waited for at a release fence. So we treat a release
+      // fence as if it depends on any previous LDS DMA stores.
+      unsigned Ordering =
+          TII->getNamedOperand(II, AMDGPU::OpName::Ordering)->getImm();
+      unsigned Scope =
+          TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm();
+      unsigned AddrSpace =
+          TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm();
+      if (isReleaseOrStronger((AtomicOrdering)Ordering) &&
----------------
Pierre-vh wrote:

Thinking about it, this part bothers me a bit because now InsertWaitCnt has to be aware of atomic orderings and deal with them accordingly. It blurs the separation of concerns between this pass and the MemoryLegalizer.

I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).
Can we consider adding a simple `s_wait_lds_dma_soft` instead, targeted exactly for this use case, and emit that ? I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now. 

I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.

https://github.com/llvm/llvm-project/pull/150167