[llvm] [AMDGPU] Optimize LDS DMA soft waitcnt (PR #138802)

Thu Jun 19 01:31:35 PDT 2025

ssahasra wrote:

> Why do we suddenly need to do that ? Is this a tailored change for a specific case? I'd like to see the reasoning in memory model terms as to why global->lds loads should be considered as normal loads and fall under the usual synchronize-with rules (and thus a new wait is needed).

A global->lds load includes a store to LDS, which may be accessed by a subsequent DS_LOAD from a _different wave_. At a release operation (could be a barrier or fence or atomic store release), we need to ensure that _all prior stores_ are finished prior to the release, including these direct stores to LDS. We already have a vmcnt(0) at global and device scopes, but not at workgroup scope. We need to insert a vmcnt at workgroup scope, but as an optimization, do it only when the release includes LDS, and there are pending LDS stores prior to the release. This set of patches is trying to model such a vmcnt using a new opcode.

https://github.com/llvm/llvm-project/pull/138802