[PATCH] D102401: [AMDGPU] Allocate LDS globals in sorted order of their alignment and size.

Wed May 19 10:32:00 PDT 2021

rampitec added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUMachineFunction.cpp:61
+    const auto *I = dyn_cast<Instruction>(UU);
+    assert(I && "Expected an instruction");
+    const Function *F = I->getFunction();
----------------
hsmhsm wrote:
> rampitec wrote:
> > hsmhsm wrote:
> > > rampitec wrote:
> > > > hsmhsm wrote:
> > > > > rampitec wrote:
> > > > > > hsmhsm wrote:
> > > > > > > rampitec wrote:
> > > > > > > > It can probably be Operator.
> > > > > > > I did not understand this comment.
> > > > > > llvm::Operator.
> > > > > llvm::Operator is an abstracrtion over Instruction and ConstExpr. An llvm::Operator should be either an Instruction OR ConstExpr, and hence explicit handling of llvm::Operator is not required.
> > > > No, it is a separate subclass of User: https://llvm.org/doxygen/classllvm_1_1User.html
> > > But, if I look into the implemenation of Operator at - https://llvm.org/doxygen/Operator_8h_source.html it clearly tells me that an instance of Operator class will never ever be created, and it is really a place holder for Instrunction and ConstExpr.
> > > 
> > > ```
> > >    /// Return the opcode for this Instruction or ConstantExpr.
> > >    unsigned getOpcode() const {
> > >      if (const Instruction *I = dyn_cast<Instruction>(this))
> > >        return I->getOpcode();
> > >      return cast<ConstantExpr>(this)->getOpcode();
> > >    }
> > > ```
> > Example:
> > 
> > ```
> > %sunkaddr4 = getelementptr inbounds i8, i8* bitcast ({ %union, [2000 x i8] }* @global_pointer to i8*), i64 %sunkaddr3
> > ```
> If I understand it correctly, @global_pointer is used within a constant expression - bitcast, and this constant expression is used within an instruction - getelementptr.
> 
> And this case is neatly handled in the code. I am not sure how relevant it to the "Operator" that you brought up. I still believe, there is no need to check for "Operator".
OK, I was unable to exploit it, it indeed cast to Constant.

================
Comment at: llvm/test/CodeGen/AMDGPU/GlobalISel/lds-allocation.ll:35
+; HSA-LABEL: {{^}}fix_alignment_0:
+; HSA:        v_mov_b32_e32 v0, 1
+; HSA:        v_mov_b32_e32 v1, 30
----------------
One remaining problem with these tests now they depend on the register allocation and actual register names. It will be difficult to maintain.

================
Comment at: llvm/test/CodeGen/AMDGPU/lds-allocation.ll:35
+
+; SDAG-MIR-LABEL: bb.0 (%ir-block.0):
+; SDAG-MIR: [[MOV_1:%[0-9]+:sreg_32]] = S_MOV_B32 0
----------------
hsmhsm wrote:
> rampitec wrote:
> > hsmhsm wrote:
> > > rampitec wrote:
> > > > The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA.
> > > > 
> > > > It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile.
> > > > 
> > > > One thing it does not check is that we do not allocate more LDS than used. It does not check the situation with multiple kernels using intersecting LDS variable sets.
> > > > 
> > > > I was looking at a test which will really check memory layout, like that is possible with AMDGPULowerModuleLDSPass. In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.
> > > I think, all of you are kind of suggesting to extend AMDGPULowerModuleLDSPass. But, I am not sure, how it is easily extendible, moreever, it is being invoked before promoteAlloca pass at the moment. We really need to discuss it and come to a common conclusion, before proceeding further.
> > > 
> > > For me, it looks like - there is a ball, half of which is painted white, and other half black, few are looking at white part, and others looking at black part, and not able to come to a common conclusion.
> > PromoteAlloca is probably not an issue, it will just add a new allocation past that struct. It should be already properly aligned as we are creating it ourselves. The current waste of LDS by the AMDGPULowerModuleLDSPass is an issue though. Ideally it would be nice to come up with a single solution for functions and kernels which I guess we are continue discussing...
> I still do not understand, if we can really handle it within AMDGPULowerModuleLDSPass.  Expecting a single struct variable and only single struct variable within every kernel is too much ideal expectation at this point. And not sure, why you guys are insistiing about it.
> 
> Strictly speaking, this pass does not update align of LDS globals? It only fix alignment and allocate memory accordingly. So, the lit tests should check for allocated boundary and total size allocated. And that is what I did here. If you are not happy with MIR result, then, I can check it on final assembly.
> 
> To answer some of your earlier comments:
> 
> >> The test checking mir is not any better than a test checking instructions. In fact it is worse because it is using unstable mir interface. It still does not check for an actual layout in any transparent way, it checks the offsets. You can check the offsets in the produces ISA.
> 
> I can change it to check instructions instead of mir. Based on the sorting algorithm implemented - we expect which LDS global should be allocated on which assdresses, and how much LDS should be allocated in total.  That is what I did here.
> 
> >> It does not help that checks are autogenerated, it obscures relevant checks. For that sort of test you need to check the offsets, nothing more. It also depends on an assumption that stores will not be rescheduled and there is no such guarantee in this case. To w/a it you can store different values (e.g. 1, 2, 3, 4 etc), check that in the ISA, and mark stores volatile.
> 
> >> Let me see, what is missing here.
> 
> >> One thing it does not check is that we do not allocate more LDS than used. 
> 
> I am checking for it as in  "SDAG-HSA: amdhsa_group_segment_fixed_size 31"
> 
> >> It does not check the situation with multiple kernels using intersecting LDS variable sets.
> 
> I will try to improve the test cases..
> 
> >> I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass
> 
>   Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting?
> 
> >> In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.
> 
> I doubt if we can quickly achieve it considering all the mess happening right now.
> >> I was looking at a test which will really check memory layout like that is possible with AMDGPULowerModuleLDSPass
> 
>   Checking which lds global should be allocated on which memory is not memory layout? Could you please give an example of what are you expecting?

I was thinking about a test just showing an IR globals after the transformation, but I guess since it is still just the lowering the IR is not updated and such test is not possible.

> >> In fact I am probably second to Matt we can try to use AMDGPULowerModuleLDSPass for kernels too, not to multiply entities.

> I doubt if we can quickly achieve it considering all the mess happening right now.

Given it does not solve minimal per-kernel allocation as we discussed yes, we cannot just use it right away.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102401/new/

https://reviews.llvm.org/D102401