[llvm] [AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. (PR #87265)

Thu Aug 22 12:31:15 PDT 2024

b-sumner wrote:

> The runtime is already bounded on how many groups it can dispatch at once; the allocation is tied to the dispatch size.

The runtime does split the dispatch into machine-sized chunks.  If it does have a limit, then it is probably much larger than we want to allocate for.

> 
> I think having the trap door of pure software LDS would enable some useful experiments, such as not depending on any whole program visibility to lower function defined local variables. It also reduces the number of parts that need to directly interact in the compiler pipeline. With the current approach I foresee having to fix the same bugs twice in the module LDS lowering, and the asan version of module LDS lowering

I don't disagree.  But reading global memory for the pointer will be slower.  The runtime launching one dispatch at a time to manage the memory will be slower, and we still need a kernel prolog and epilog for each workgroup to allocate and deallocate it's chunk of the global allocation, and I still don't know where we are going to store the per-workgroup workgroup-allocation-chunk-index or pointer.  

https://github.com/llvm/llvm-project/pull/87265