[llvm] [AMDGPU] Fix code sequence for barrier start in GFX10+ CU Mode (PR #160501)

Wed Oct 1 02:28:59 PDT 2025

Pierre-vh wrote:

> > > Do you think we should instead pessimize all workgroup release fences in CU mode so they have a wait on storecnt?
> > 
> > 
> > Is it a pessimization? I don't think so. Isn't the example @perlfu gave offline evidence that if a release fence intends to fence global memory, then a storecnt wait is pretty much unavoidable?
> 
> I agree. It's a bug fix, not a pessimization. On the other hand, the programmer may know that a certain part of the program only cares about synchronization within the workgroup. For such a program, opting out of transitivity is an optimization, which needs a way to be expressed in LLVM IR.

As I understand it, it's fine to not wait because the release only occurs when the other thread observes the atomic store done as part of the release sequence. This is why we need to do spin (loop) on an acquire if we don't have a barrier for example, because we know the release didn't occur until we load the right value.
So if we take barriers out of the picture, it is fine to not wait because when the store is seen, all previous stores are seen as well (for CU mode workgroup scope).

The problem here is very barrier specific because we're introducing a model where we synchronize without the classic release/acquire sequences that rely on an atomic store. Instead we're adding a barrier + fence pairing, and we synchronize when leaving the barrier. We remove the requirement to spin on the acquire when a barrier is present.

https://github.com/llvm/llvm-project/pull/160501