[llvm] [AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer (PR #150167)

Thu Jul 24 06:25:04 PDT 2025

================
@@ -1381,6 +1383,32 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
         Modified = true;
       } else
         WaitcntInstr = &II;
+    } else if (Opcode == AMDGPU::S_WAITCNT_FENCE_soft) {
+      // Each direct load to LDS is also a store to LDS, but we do not have a
+      // separate counter for it. Instead these operations increment LOAD_CNT
+      // and need to be waited for at a release fence. So we treat a release
+      // fence as if it depends on any previous LDS DMA stores.
+      unsigned Ordering =
+          TII->getNamedOperand(II, AMDGPU::OpName::Ordering)->getImm();
+      unsigned Scope =
+          TII->getNamedOperand(II, AMDGPU::OpName::Scope)->getImm();
+      unsigned AddrSpace =
+          TII->getNamedOperand(II, AMDGPU::OpName::AddrSpace)->getImm();
+      if (isReleaseOrStronger((AtomicOrdering)Ordering) &&
----------------
ssahasra wrote:

> I know there is a good argument for doing that, but I think this being too generic for what we need at this stage. It's something that needs a lot of planning beforehand (and it's an item on my to-do list, though lower priority).

Quoting from #147257:

> Something still doesn't feel right with this PR for me, I feel like this isn't the right approach but I struggle to suggest something better.
>
> Longer term we should really just have a single waitcnt pseudo for the MemoryLegalizer that is target-independent, it'd fix issues like these if we had special sentinel values for different things.

In all sincerity, could be possible that there is an analysis paralysis happening here? Are we overthinking this situation? What could be more effective as "a single waitcnt pseudo" than a fence? What is an example of something less generic than a fence?

> Can we consider adding a simple s_wait_lds_dma_soft instead, targeted exactly for this use case, and emit that ?

I consider that too specific. I contend that the distinction between the memory legalizer and the waitcount inserter absolutely needs to be blurred. They cannot exist separately, they implement the memory model together and complement each other in that process. One specific problem with having an S_WAIT_LDS_DMA_soft is that it is needed only on release operations but not on acquire operations.

> I would prefer doing the minimum amount of changes, and then removing that pseudo later in favor of a generic one, than locking us into a specific approach right now.

Do you have any specific examples of potential concerns? This approach is not locking us into anything more than information that is already relevant to the memory model, which orderings, scopes and address spaces. It can't possibly lock us into anything.

> I think what I'm afraid of is that this sets a precedent, and over time I suspect we'll rely more and more on this pseudo here and elsewhere (e.g. instead of fixing something properly, we just check the pseudo elsewhere and hack a fix there instead), and end up with the memory model implementation being spread over multiple files, which will make it difficult to manage.

That is precisely my intention. It is a mistake to think that only the memory legalizer is relevant to the memory model. It can produce "safe cache operations and waits", but it can't produce efficient ones. The real memory model has to be spread across two files, or perhaps we should merge those two files. But I don't see this as a major blocker for what I am proposing here.

https://github.com/llvm/llvm-project/pull/150167