[PATCH] D102401: [AMDGPU] Allocate LDS globals in sorted order of their alignment and size.

Mon May 17 20:59:31 PDT 2021

hsmhsm added a comment.

In D102401#2764143 <https://reviews.llvm.org/D102401#2764143>, @arsenm wrote:

> In D102401#2763925 <https://reviews.llvm.org/D102401#2763925>, @hsmhsm wrote:
>
>> But, do we have any immediate available neat solution(s) which will make sure that all the LDS globals which are used within a kernel  is replaced by single struct (or something else) instance?
>
> That's essentially what AMDGPULowerModuleLDSPass does, and could be extended to have a kernel specific table

Ok, assume that we will extend AMDGPULowerModuleLDSPass to have a kernel specific table. What I am not getting is - how to use this table in order to replace LDS globals by their offset counter-parts within non-kernel functions. Passing kernel-id as an argument is ruled out, since the presence of indirect calls will really break this since function pointers can be used in so many varieties of ways.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp:61
+    const auto *I = dyn_cast<Instruction>(UU);
+    assert(I && "Expected an instruction");
+    const Function *F = I->getFunction();
----------------
rampitec wrote:
> hsmhsm wrote:
> > rampitec wrote:
> > > hsmhsm wrote:
> > > > rampitec wrote:
> > > > > It can probably be Operator.
> > > > I did not understand this comment.
> > > llvm::Operator.
> > llvm::Operator is an abstracrtion over Instruction and ConstExpr. An llvm::Operator should be either an Instruction OR ConstExpr, and hence explicit handling of llvm::Operator is not required.
> No, it is a separate subclass of User: https://llvm.org/doxygen/classllvm_1_1User.html
But, if I look into the implemenation of Operator at - https://llvm.org/doxygen/Operator_8h_source.html it clearly tells me that an instance of Operator class will never ever be created, and it is really a place holder for Instrunction and ConstExpr.

```
   /// Return the opcode for this Instruction or ConstantExpr.
   unsigned getOpcode() const {
     if (const Instruction *I = dyn_cast<Instruction>(this))
       return I->getOpcode();
     return cast<ConstantExpr>(this)->getOpcode();
   }
```

================
Comment at: llvm/lib/Target/AMDGPU/Utils/AMDGPULDSUtils.cpp:115
+  SmallVector<const GlobalVariable *, 8> StaticLDSGlobals =
+      collectStaticLDSGlobals(&M);
+
----------------
rampitec wrote:
> hsmhsm wrote:
> > rampitec wrote:
> > > Maybe just pass a filter lambda there?
> > I personally think that it will be over complicated - collectStaticLDSGlobals() is called by both findVariablesToLower() and getSortedUsedLDSGlobals().  Further filtering of LDS globals is required for findVariablesToLower(), but not for getSortedUsedLDSGlobals().
> I do not like the idea of allocating memory, then copying.
Then you mean to say, let's return vector, instead of passing it as reference arg? then we cannot use SmallVectorImpl<> as return type, so, we need to go back to returning SmallVector<> itself.

================
Comment at: llvm/test/CodeGen/AMDGPU/lds-allocation.ll:35
+
+; SDAG-MIR-LABEL: bb.0 (%ir-block.0):
+; SDAG-MIR: [[MOV_1:%[0-9]+:sreg_32]] = S_MOV_B32 0
----------------
rampitec wrote:
> The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA.
> 
> It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile.
> 
> One thing it does not check is that we do not allocate more LDS than used. It does not check the situation with multiple kernels using intersecting LDS variable sets.
> 
> I was looking at a test which will really check memory layout, like that is possible with AMDGPULowerModuleLDSPass. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.
I think, all of you are kind of suggesting to extend AMDGPULowerModuleLDSPass. But, I am not sure, how it is easily extendible, moreever, it is being invoked before promoteAlloca pass at the moment. We really need to discuss it and come to a common conclusion, before proceeding further.

For me, it looks like - there is a ball, half of which is painted white, and other half black, few are looking at white part, and others looking at black part, and not able to come to a common conclusion.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102401/new/

https://reviews.llvm.org/D102401