[llvm] [AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. (PR #87265)

Thu Aug 22 11:16:05 PDT 2024

arsenm wrote:

> OK. Suppose the launch has a million work groups. How much memory should the runtime allocate, and how will workgroup J decode what part of that memory to use? 

The runtime is already bounded on how many groups it can dispatch at once; the allocation is tied to the dispatch size.

> It can certainly be done but I'm wondering if we really need to do it now? And how much do we really need an independently working SW LDS?

I think having the trap door of pure software LDS would enable some useful experiments, such as not depending on any whole program visibility to lower function defined local variables. It also reduces the number of parts that need to directly interact in the compiler pipeline. With the current approach I foresee having to fix the same bugs twice in the module LDS lowering, and the asan version of module LDS lowering 

https://github.com/llvm/llvm-project/pull/87265