[PATCH] D91516: [AMDGPU] Support for device scope shared variables

Tue Nov 17 23:20:35 PST 2020

hsmhsm added a comment.

In D91516#2398665 <https://reviews.llvm.org/D91516#2398665>, @hsmhsm wrote:

> In D91516#2397323 <https://reviews.llvm.org/D91516#2397323>, @arsenm wrote:
>
>> This is a bit different than the most recent proposal which I thought avoided the need to pass multiple arguments per kernel and allowed supporting indirect calls. I thought this was going to produce a table in constant memory containing the offsets which would be indexed instead.
>
> We had not arrived at any general consensus on which approach to stick to.  Following were the sugguested proposals:
>
> (1)  Function argument driven approach:
> (2)  Table driven approach:
>
>   (2.1)  Table within global memory
>   (2.2)  Table within shared memory
>   (2.2.)  Table within constant memory
>
> Being a less used feature, and also as suggested by Sam in his one of the early emails (while disussing it), I  have choosen approach (1) since I felt that it is comparitavely simpler approach, and try others only when this approach does not practically work either due to performance issues or any other valid reasons.

>> Needs test with indirect calls and stored function addresses.
>
> Indirect calls are something I completely missed. Thanks for pointing it.  Let me take a look at whether the current implementation handles indirect calls too.

My further analysis tells me that indirect function calls are something which are bit tricky to handle irrespective of which approach we take. It requires further thinking. For example, we need to first traverse call graph from kernel to end-callee (where LDS globals are defined) to collect all those LDS globals which need to be clubbed together within the kernel, and for that we need to traverse call graph (see line number 229 pairUpKernelWithCalleeList()), and call graph does not capture information about indirect calls as I see it.

So, as I updated in the summary, this implementation will not handle indirect function calls. I personally think that this implemetation is currently okay to satisfy the need of RCCL team since anyway indirect function calls are yet to be properly supported in AMDGPU back-end. If you think, other way around, then we need to fundamentally rethink about whole implementation.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:163
+    auto OldCallieList = KI->second;
+    std::set<Function *> NewCallieList;
+    for (auto *OldCallie : OldCallieList) {
----------------
hsmhsm wrote:
> arsenm wrote:
> > Spelling, Callee
> agree.
done

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:330-332
+unsigned getFixedSizeOfTypeInBytes(Module &M, Type *Ty) {
+  return getFixedSizeOfTypeInBits(M, Ty) / 8;
+}
----------------
hsmhsm wrote:
> arsenm wrote:
> > getTypeStoreSize
> agree
done

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:411
+
+  switch (I->getOpcode()) {
+  case Instruction::GetElementPtr: {
----------------
hsmhsm wrote:
> arsenm wrote:
> > I don't see why you need to specially handle all user types. You aren't changing the type, so you should be able to do a simple replaceAllUsesWith
> Let me take a relook at it.
done

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:632-635
+    auto *CI = dyn_cast<CallInst>(U);
+#ifndef NDEBUG
+    assert(CI && "Valid call instruction expected");
+#endif
----------------
hsmhsm wrote:
> arsenm wrote:
> > Just use cast<>. This will also fail on cases where you are storing the address of the function, so I don't think this is whaty ou want.
> Let me take a relook at it.
As I updated in the summary, we have certain blockers that we need handle for indirect calls. And this implementation does not handle indirect calls. So, this check is still valid from the view of current implementation.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:930-934
+static Type *getLDSBaseType(Type *LDSTy) {
+  if (auto *ArrTy = dyn_cast<ArrayType>(LDSTy))
+    return getLDSBaseType(ArrTy->getElementType());
+  return LDSTy;
+}
----------------
hsmhsm wrote:
> arsenm wrote:
> > I don't understand what the "base type" means or why we would care
> "base type" means base type of an aggregate type. I somehow assumed that information about base type of an aggregate type is required. But, let me take a relook at it.
done

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:984-985
+  // floating-point types.
+  if (!LDSBaseType->isIntegerTy() && !LDSBaseType->isFloatingPointTy())
+    llvm_unreachable("Not Implemented.");
+
----------------
hsmhsm wrote:
> arsenm wrote:
> > Memory types don't matter, there should be no need to consider it at all
> As mentioned above, let me take a relook at it.
done

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91516/new/

https://reviews.llvm.org/D91516