[llvm] [AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer (PR #150167)

Fri Jul 25 02:53:40 PDT 2025

================
@@ -1381,6 +1383,32 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
         Modified = true;
       } else
         WaitcntInstr = &II;
+    } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+      // Each direct load to LDS is also a store to LDS, but we do not have a
+      // separate counter for it. Instead these operations increment LOAD_CNT
+      // and need to be waited for at a release fence. So we treat a release
+      // fence as if it depends on any previous LDS DMA stores.
+      unsigned Ordering =
+          TII->getNamedOperand(II, AMDGPU::OpName::Ordering)->getImm();
+      unsigned Scope =
+          TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm();
+      unsigned AddrSpace =
+          TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm();
+      if (isReleaseOrStronger((AtomicOrdering)Ordering) &&
----------------
Pierre-vh wrote:

> In all sincerity, could it be possible that there is an analysis paralysis happening here? Are we overthinking this situation?

Yes, definitely. Sorry about that.

I'm going to try to lay out my thoughts in a simpler way, so I don't start contradicting myself again:

- I agree the MemoryLegalizer and InsertWaitCnt are inseparable, but there is still some separation of concern: InsertWaitCnt doesn't look at atomic orderings for example. That can be blurred in the future, but as the owner of the MemoryLegalizer I would like it to be a separate task that's more carefully planned, rather than done to fix a specific issue.
  - Furthermore, I see the legalizer's role as  "implement the memory model in a conservative way" while InsertWaitCnt is "optimize the waitcnts while preserving semantics" (+ insert new waits ofc)
- When I imagined a generic pseudo for waitcnt insertion, I did not imagine something that includes information like the atomic ordering. I imagined something with ad-hoc bits that carry only the information we need, and nothing else.
  - For example, we had a few times when we wondered if the memory legalizer needed to insert waits on vm_vsrc. That can't be conveyed using the AS/Ordering alone. We need something more specific, something where we can feel free to add new flags for any reason we see fit.
- My worry i that if the operation is too generic, and carries info like atomic ordering, it opens the door to implementing some memory model fixes outside the legalizer (e.g. waitcnts for specific fences would now be handled by InsertWaitCnt without the legalizer's knowledge) which I do not want as it's best kept in one place. 
  - I guess it's valid to see it as irrational as I don't have proof that it could/will happen.

So in my opinion, @kerbowa's approach in https://github.com/llvm/llvm-project/pull/138802 fits best.
Yes it's not ideal, but there's a lot of things not ideal with the current way things are laid out right now. I'd rather keep on the same trajectory by adding a specific pseudo, and then refactor it all in one batch, than try something new to fix a specific problem.

Again, sorry for derailing this a bit. The discussion spread over weeks and multiple PRs so I lost context and contradicted myself a few times.

https://github.com/llvm/llvm-project/pull/150167