[PATCH] D103225: [AMDGPU] Replace non-kernel function uses of LDS globals by pointers.

Thu Jun 10 02:31:13 PDT 2021

hsmhsm marked 4 inline comments as done.
hsmhsm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp:120
+  DenseMap<Function *, SmallPtrSet<GlobalVariable *, 8>> KernelToLDSPointers;
+  DenseMap<Function *, BasicBlock *> KernelToInitBB;
+  DenseMap<Function *, DenseMap<GlobalVariable *, Value *>>
----------------
rampitec wrote:
> Do you really need this new map? You can get a pointer to block from KernelToLDSPointers by getting a parent of the first instruction.
I think we need it. Consider below transformed code. Here we need to track the newly introduced block **st** since this is where we need to insert store instructions. And I am not sure, how can I get it from KernelToLDSPointers.

```
    define protected amdgpu_kernel void @k0() {
    entry:
      %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
      %1 = call i32 @llvm.amdgcn.mbcnt.hi(i32 -1, i32 %0)
      %2 = icmp eq i32 %1, 0
      br i1 %2, label %st, label %ft
    st:
      store i16 ptrtoint ([4 x i32] addrspace(3)* @lds to i16),  i16 addrspace(3)* @lds.ptr, align 2
      br label %ft
    ft:
      call void @llvm.amdgcn.wave.barrier()
      call void @f0()
      ret void
    }
```

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp:150
+    LDSToNonKernels[GV] = AMDGPU::collectNonKernelAccessorsOfLDS(GV);
+    return LDSToNonKernels[GV].empty();
+
----------------
rampitec wrote:
> hsmhsm wrote:
> > rampitec wrote:
> > > If it is empty it probably will not be returned by findVariablesToLower()?
> > That is true.  But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it.  In this case, collectNonKernelAccessorsOfLDS() returns empty set.
> > 
> > ```
> > @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
> > 
> > @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64*), align 8
> > 
> > define void @f0() {
> >   %bc = bitcast [1 x i8] addrspace(3)* @lds.1 to i8 addrspace(3)*
> >   store i8 1, i8 addrspace(3)* %bc, align 1
> >   ret void
> > }
> > 
> > define amdgpu_kernel void @k0() {  
> >   call void @f0()
> >   ret void
> > }
> > ```
> > That is true.  But, there is a catch. Assume the below case. Here "@lds.1" used within non-kernel function scope, but also it is used as initializer to global @gptr.1. So, we cannot pointer replace it.  In this case, collectNonKernelAccessorsOfLDS() returns empty set.
> > 
> > ```
> > @lds.1 = internal unnamed_addr addrspace(3) global [1 x i8] undef, align 1
> > 
> > @gptr.1 = addrspace(1) global i64* addrspacecast ([1 x i8] addrspace(3)* @lds.1 to i64*), align 8
> > 
> > define void @f0() {
> >   %bc = bitcast [1 x i8] addrspace(3)* @lds.1 to i8 addrspace(3)*
> >   store i8 1, i8 addrspace(3)* %bc, align 1
> >   ret void
> > }
> > 
> > define amdgpu_kernel void @k0() {  
> >   call void @f0()
> >   ret void
> > }
> > ```
> 
> Should not it be fixed by D103431?
The patch https://reviews.llvm.org/D103431 has fixed cases where few globally used lds were not getting lowering. Now, here is the basic funda:

The pass LowerModuleLDS lowers LDS which are used either in global scope or in non-kernel function scope or in both. And it directly packs the LDS within struct which we want to avoid since it wastes memory. 

So, in this pointer-replacement patch, we try to replace uses of LDS by pointers so that LowerModuleLDS pass ends-up packing pointers instead of LDS themselves.

However, we can only pointer replace those LDS which are used within non-kernel function scope, but not the ones which are used in global scope. Now in the above example, assume that we create a pointer to @lds.1 since it is used within non-kernel function scope, and replace its use within function by pointer.  Then LowerModuleLDS pass sees use of @lds.1 (in global scope) and the use of pointer within function. Hence it lowers both lds and its pointer, means it packs both @lds.1 and its pointer within struct, which is waste. 

Hence this pass does not touch the lds which are used both in global and non-kernel function scope, and leaves those lds to getting directly lowered by LowerModuleLDS pass, means directly getting packed within struct. And, I assume such cases are not frequent in practice. 

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp:152
+
+    // FIXME: Any other scenarios which disqualify LDS from replacement?
+  }
----------------
rampitec wrote:
> hsmhsm wrote:
> > rampitec wrote:
> > > JonChesterfield wrote:
> > > > Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels.
> > > > 
> > > > Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit.
> > > > Don't replace it when the variable is used by all (possibly most) kernels that can access any LDS. For example, when the variable is part of a language runtime that has been linked into all kernels.
> > > > 
> > > > Such a variable will still be allocated in all the kernels so putting it behind a pointer wastes two bytes of LDS everywhere and introduces code to do the indirection with zero benefit.
> > > 
> > > Sound like a good TODO at the very least.
> > It is in my TODO list. It requires some logical thinking, before carefully handling this case. That is the reason, I had not yet replied to Jon's comment.
> You can leave a TODO comment here for now. Patch is already too big, so it will need a separate change anyway.
I have  kept is as TODO and come to it later in a separate patch.

================
Comment at: llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll:15
+; Pointer should be created.
+; CHECK: @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
+
----------------
rampitec wrote:
> This likely should not be i16. You could do it this way:
> ```
> @lds_used_within_func.ptr = internal unnamed_addr addrspace(3) global i8 addrspace(3)* undef, align 2
> ```
> Then use bitcast to i8 addrspace(3)* instead of ptrtoint on store.
> 
> But I still think that creating module lds right here is better.
The main goal of using i16 as pointer (with ptrtoint and GEP) is to save memory as much as possible. Using  "i8 addrspace(3)*" means we are going to allocate memory size for pointer (not sure if it is 4 or 8 bytes, but definitely greater than 2 bytes). Hence the use of i16 type here.

Also, I do not think, it is a good idea to create module lds right here. If that was the case, we could have put this code within module LDS pass itself instead of having separate pass (I did it initially, later we decided that pointer-replacement code should go as separate pass).

This pass is already too big, let's not change the design again and further complicate the stuff which will further delay code submit, and which leads to other serious problems.

================
Comment at: llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll:41
+; CHECK:   %0 = load i16, i16 addrspace(3)* @lds_used_within_function_2.ptr, align 2
+; CHECK:   %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
+; CHECK:   %2 = bitcast i8 addrspace(3)* %1 to [2 x i32] addrspace(3)*
----------------
rampitec wrote:
> hsmhsm wrote:
> > hsmhsm wrote:
> > > rampitec wrote:
> > > > We had problems with null before. Instruction selection will happily allocate it overlapping with first non-null variable. That's how AMDGPULowerModuleLDSPass turned to use struct instead of null. Did you check we are not introducing this again in the final ISA?
> > > The null represent the base address of LDS memory.  My understanding is that, earlier, instruction selection had gone for toss because of the bug where we were not actually allocating memory at all for llvm.amdgcn.module.lds, but we were accessing it at address 0. And, instruction selection was allocating address 0 for some other lds (since this address was not occupied), so there was an issue. 
> > > 
> > > But, nevertheless, let me look into ISA and see if we are fine or any issue because of null.
> > Further, I looked into ISA, it looks good to me. For example consider below input code.
> > 
> > ```
> > @lds.1 = internal addrspace(3) global i32 undef, align 4
> > @lds.2 = internal addrspace(3) global i64 undef, align 4
> > 
> > define internal void @func_uses_lds() {
> > entry:
> >   store i32 7, i32 addrspace(3)* @lds.1
> >   store i64 31, i64 addrspace(3)* @lds.2
> >   ret void
> > }
> > 
> > define protected amdgpu_kernel void @kernel_reaches_lds() {
> > entry:
> >   call void @func_uses_lds()
> >   ret void
> > }
> > ```
> > 
> > output from pointer-replacement is: 
> > 
> > ```
> > @lds.1 = internal addrspace(3) global i32 undef, align 4
> > @lds.2 = internal addrspace(3) global i64 undef, align 4
> > @lds.1.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
> > @lds.2.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2
> > 
> > define internal void @func_uses_lds() {
> > entry:
> >   %0 = load i16, i16 addrspace(3)* @lds.2.ptr, align 2
> >   %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
> >   %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)*
> >   %3 = load i16, i16 addrspace(3)* @lds.1.ptr, align 2
> >   %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
> >   %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
> >   store i32 7, i32 addrspace(3)* %5, align 4
> >   store i64 31, i64 addrspace(3)* %2, align 4
> >   ret void
> > }
> > 
> > define protected amdgpu_kernel void @kernel_reaches_lds() {
> > entry:
> >   %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
> >   %1 = icmp eq i32 %0, 0
> >   br i1 %1, label %2, label %3
> > 
> > 2:                                                ; preds = %entry
> >   store i16 ptrtoint (i64 addrspace(3)* @lds.2 to i16), i16 addrspace(3)* @lds.2.ptr, align 2
> >   store i16 ptrtoint (i32 addrspace(3)* @lds.1 to i16), i16 addrspace(3)* @lds.1.ptr, align 2
> >   br label %3
> > 
> > 3:                                                ; preds = %entry, %2
> >   call void @llvm.amdgcn.wave.barrier()
> >   call void @func_uses_lds()
> >   ret void
> > }
> > ```
> > 
> > output from module (kernel) lds lowering is:
> > 
> > ```
> > %llvm.amdgcn.module.lds.t = type { i16, i16 }
> > %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t = type { i64, i32 }
> > 
> > @llvm.amdgcn.module.lds = internal addrspace(3) global %llvm.amdgcn.module.lds.t undef, align 2
> > @llvm.compiler.used = appending global [1 x i8*] [i8* addrspacecast (i8 addrspace(3)* bitcast (%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds to i8 addrspace(3)*) to i8*)], section "llvm.metadata"
> > @llvm.amdgcn.kernel.kernel_reaches_lds.lds = internal addrspace(3) global %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t undef, align 8
> > 
> > define internal void @func_uses_lds() {
> > entry:
> >   %0 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
> >   %1 = getelementptr i8, i8 addrspace(3)* null, i16 %0
> >   %2 = bitcast i8 addrspace(3)* %1 to i64 addrspace(3)*
> >   %3 = load i16, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
> >   %4 = getelementptr i8, i8 addrspace(3)* null, i16 %3
> >   %5 = bitcast i8 addrspace(3)* %4 to i32 addrspace(3)*
> >   store i32 7, i32 addrspace(3)* %5, align 4
> >   store i64 31, i64 addrspace(3)* %2, align 4
> >   ret void
> > }
> > 
> > define protected amdgpu_kernel void @kernel_reaches_lds() {
> > entry:
> >   call void @llvm.donothing() [ "ExplicitUse"(%llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds) ]
> >   %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
> >   %1 = icmp eq i32 %0, 0
> >   br i1 %1, label %2, label %5
> > 
> > 2:                                                ; preds = %entry
> >   %3 = ptrtoint i64 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 0) to i16
> >   store i16 %3, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 1), align 2
> >   %4 = ptrtoint i32 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.kernel.kernel_reaches_lds.lds.t, %llvm.amdgcn.kernel.kernel_reaches_lds.lds.t addrspace(3)* @llvm.amdgcn.kernel.kernel_reaches_lds.lds, i32 0, i32 1) to i16
> >   store i16 %4, i16 addrspace(3)* getelementptr inbounds (%llvm.amdgcn.module.lds.t, %llvm.amdgcn.module.lds.t addrspace(3)* @llvm.amdgcn.module.lds, i32 0, i32 0), align 2
> >   br label %5
> > 
> > 5:                                                ; preds = %entry, %2
> >   call void @llvm.amdgcn.wave.barrier()
> >   call void @func_uses_lds()
> >   ret void
> > }
> > ```
> > 
> > what it fundamentally means is: 
> > 
> > 1.  address 0 holds the address of @lds.1, we need to load address of @lds.1 from address 0, and then we need to store value 7 to @lds.1.
> > 2.  address 0 + 2 (offset 2) holds the address of @lds.2, we need to load address of @lds.2 from address 0 + 2, and then we need to store value 31 to @lds.2. That is what happening in the below ISA which is produced for above input ir.
> > 
> > ```
> > func_uses_lds:                          ; @func_uses_lds
> > ; %bb.0:                                ; %entry
> >         s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
> >         v_mov_b32_e32 v1, 0
> >         ds_read_i16 v2, v1
> >         ds_read_i16 v3, v1 offset:2
> >         v_mov_b32_e32 v4, 7
> >         v_mov_b32_e32 v0, 31
> >         s_waitcnt lgkmcnt(1)
> >         ds_write_b32 v2, v4
> >         s_waitcnt lgkmcnt(1)
> >         ds_write_b64 v3, v[0:1]
> >         s_waitcnt lgkmcnt(0)
> >         s_setpc_b64 s[30:31]
> > ```
> So the idea is llvm.amdgcn.module.lds will only contain converted pointers and since it is forced to be allocated at address zero these pointers will fall into the allocated space?
> 
> We need to be super careful here because if anything else will be injected into the llvm.amdgcn.module.lds by the module lds pass it will all break.
> 
> In fact there seems to be no work for module lds after this for the module, only for kernels. You probably can just create module structure right here. Then module lds should skip run on module (but not on kernels) if that variable already exists. It will also untangle dependency between passes. In addition you will resolve issue with null as you would just use that structure and Matt's concern about ptrtoint.
As I mentioned in one my earlier comments - we still need llvm.amdgcn.module.lds. We cannot remove it for now. 

The main reason for above is - for a given use of LDS within some non-kernel function, we need to make sure that appropriate memory for LDS is being accessed based on which kernel is called that function. We have not yet solved this issue. The current available solution is this llvm.amdgcn.module.lds. 

But, we should attempt to avoid wasting of memory by llvm.amdgcn.module.lds, hence this pointer replacement pass. 

What this pass does is below:

(1) create an i16 pointer "lds.ptr" to "lds" where "lds" is used in some non-kernel function.
(2) Within kernel:
          lds.ptr = ptrtoint(lds)
(3) Within non-kernel where lds is used:
          lds.addr = GEP(null, lds.ptr)   // where null is the base address of the LDS memory, and we use GEP here to avoid inttoptr
          replace use of "lds" by "lds.addr" after apropriately bitcasting.

What it means to LowerModuleLDS pass is below:

(1) The use of "lds" is moved to kernel (pointer initialization), hence it is properly allocated within kernel as a part of kernel specific struct, and more important it is allocated only if there is call from kernel to function. Otherwise there is no this pointer initialization, and hence no usage, and hence no allocation.

(2)  It sees the use of "lds.ptr" within non-kernel function, and hence it lowers this ptr by putting it within module struct.

Otherwise, we still allocate memory for llvm.amdgcn.module.lds within every kernel. But, now we are not wasting too much memory.

In anycase, within every kernel, llvm.amdgcn.module.lds is allocated at address 0. all the other vars (belonging to kernel) are allocated at some non-zero addresses.

In this way, I do not see any problem here, unless I am missing any corner, but serious case for conflcting with address 0.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103225/new/

https://reviews.llvm.org/D103225