[PATCH] D91516: [AMDGPU] Replace uses of LDS globals within non-kernel functions by pointers.

Mon Mar 29 01:05:31 PDT 2021

hsmhsm added a comment.

In D91516#2643733 <https://reviews.llvm.org/D91516#2643733>, @JonChesterfield wrote:

> This is much more complicated than I expected.

We need to *really* discuss, what is complicated here and what is violated here from the internal email discussions.

> Is the large amount of comments largely from a previous patch doing different things that has been hammered into this one?

No, nothing is hammered from the previous patch. The current patch is implementing what is planned via internal emial discussion.

> @jdoerfert the transform I think this is intended to do is:
>
> - find a large shared variable used from a function
> - add a new void*, also in shared, pointing to it
> - initialize that void* only in kernels that can call functions that use the large variable
> - replace all uses with

No, the intended implementation plan which is implemented here is as follows.

(1) Identify the LDS globals (whether large or small) which are used within non-kernel function scope and in global scope.
(2) Create new LDS glboals of i16 type corresponding to every LDS global identified above. The i16 typed LDS globals act as pointers to corresponding original LDS globals.
(3) push the *use* of above identified LDS globals to kernels by adding instructions within the kernels which initialize the address of original LDS globals to their respective pointers. This is will make sure that per kernel LDS allocation for these LDS globals correctly happen.
(4) Within non-kernel functions, replace the *use* of original LDS globals by thier respective pointers.
(5) Keep the global scope use of original LDS globals unchanged since now they should work automatically as the use of these original LDS globals (pointer initialiation) also there within all kernels and hence it will semantically work correct as expected because of per kernel LDS allocation for these LDS globals.

> That means, on amdgcn, the large variable only costs LDS space in kernels that definitely use it. I don't know how cuda lowers shared accesses from functions, it could plausibly benefit from the same transform.

Let's not bother about how CUDA handles it since there is lot of differences here. And focus on only AMDGCN.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91516/new/

https://reviews.llvm.org/D91516