[PATCH] D91516: [AMDGPU] Support for device scope shared variables

Thu Jan 7 09:27:16 PST 2021

arsenm added a comment.

This is ignoring the alignments of the variables. This needs some tests to stress padding edge cases, and also probably should try sorting to minimizing padding.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:59-60
+//      computed as in above item C.
+//   F. Within each kernel, type cast the corresponding single big shared memory
+//      layout to `char*`, and pass this type-casted pointer and the kernel
+//      number as new function arguments along the call graph path(s) so that
----------------
No reason to mention C types

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:70
+//
+// Is one really exist?
+//
----------------
Don't understand this comment

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:197-198
+INITIALIZE_PASS(AMDGPUDeviceScopeSharedVariable,
+                "implement-amdgpu-device-scope-shared-variable",
+                "Implement AMDPGU Device Scope Shared Variable",
+                false /*only look at the cfg*/, false /*analysis pass*/)
----------------
I don't like including "implement" in the name, and most of the backend doesn't refer to this as shared memory. How about "amdgpu-lower-local-local-memory-globals"

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:198
+                "implement-amdgpu-device-scope-shared-variable",
+                "Implement AMDPGU Device Scope Shared Variable",
+                false /*only look at the cfg*/, false /*analysis pass*/)
----------------
Spelling AMDPGU

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:201-207
+static unsigned getTypeStoreSizeInBits(Module &M, Type *Ty) {
+  return M.getDataLayout().getTypeSizeInBits(Ty).getFixedSize();
+}
+
+static unsigned getTypeStoreSizeInBytes(Module &M, Type *Ty) {
+  return getTypeStoreSizeInBits(M, Ty) / 8;
+}
----------------
These functions only make this harder to read. You're losing value vs. just directly using the datalayout functions

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:242
+  // actually defined.
+#ifndef NDEBUG
+  assert(I->getParent()->getParent() == F &&
----------------
ifndef is redundant

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:250
+  // of `LDS` by `LDSAccessInst`.
+  Instruction *NewI = I->clone();
+  unsigned Ind = 0;
----------------
This function is too mechanically complicated. There shouldn't be any need to clone any instructions. Only individual values need replacing

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:281
+    // Cast away const-ness from `U`.
+    User *UU = const_cast<User *>(U);
+
----------------
Should not const_cast

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:292
+    } else
+      llvm_unreachable("Not Implemented."); // TODO: What else is missing?
+  }
----------------
Should be plain Constant, not just ConstantExpr

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:304-306
+  // Suffix the names of the instructions with unique integer values.
+  static int Suffix = 0;
+  ++Suffix;
----------------
This happens for you automatically, you don't need to manage this

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:327
+                      KIndex, LIndex};
+  auto *GEPInst = GetElementPtrInst::CreateInBounds(
+      LDSOffsetTable->getValueType(), LDSOffsetTable, Indices,
----------------
Should use IRBuilder instead of the more cumbersome instruction constructors

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:332
+  // Insert LOAD instruction which loads `offset` value from LDS offset table.
+  auto *LInst = new LoadInst(GEPInst->getType()->getPointerElementType(),
+                             GEPInst, Twine("dssv.load.") + Twine(Suffix),
----------------
Don't use the pointee element type

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:355-359
+static void updateFunctionAssociatedWithLDS(
+    Module &M, GlobalVariable *LDS,
+    ValueMap<GlobalVariable *, Function *> &LDSToFunction,
+    std::map<GlobalVariable *, uint64_t> &LDSToID,
+    GlobalVariable *LDSOffsetTable) {
----------------
Just inline this function?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:428-432
+static void
+updateNonCallInsts(ValueMap<Function *, Function *> &OldCalleeToNewCallee,
+                   SetVector<Instruction *> &NonCallInsts) {
+  replaceNonCallInsts(OldCalleeToNewCallee, NonCallInsts);
+}
----------------
Pointless function

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:454-457
+#ifndef NDEBUG
+  assert(MD && "Metadata about indirect call targets expected\n");
+#endif
+
----------------
This is not OK. The metadata is purely optional

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:469-470
+static void copyDataFromOldToNewCallSite(CallInst *OldCI, CallInst *NewCI) {
+  // TODO: Why copyMetadata() not copying meta data. I see metadat associated
+  // with CI, but it is not copied to NewCI. CI->hasMetadata() is false, why?
+  NewCI->copyMetadata(*OldCI);
----------------
Comment doesn't make sense

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:480
+  auto *BasePtrTy = Type::getInt8PtrTy(M.getContext(), AMDGPUAS::LOCAL_ADDRESS);
+  auto *KernNumTy = Type::getInt64Ty(M.getContext());
+
----------------
i32 should be more than sufficient

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:484-485
+  auto *FnTy = Callee->getFunctionType();
+  for (auto PI = FnTy->param_begin(), PE = FnTy->param_end(); PI != PE; ++PI)
+    NewParams.push_back(*PI);
+
----------------
I think you can do this in the constructor like NewParams(param_begin(), param_end())

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:496
+
+static void
+getNewArgumentList(Module &M, std::map<Function *, uint64_t> &KernelToID,
----------------
Comment this function

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:507-508
+    auto KernelNum = KernelToID[Caller];
+    KernNumArg = Constant::getIntegerValue(Type::getInt64Ty(M.getContext()),
+                                           APInt(64, KernelNum));
+  } else {
----------------
ConstantInt::get or IRBuilder::getInt64 is simpler

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:521-525
+static void replaceIndirectCallSites(
+    Module &M, std::map<Function *, uint64_t> &KernelToID,
+    std::map<Function *, Instruction *> &KernelToBasePtrInst,
+    ValueMap<Function *, Function *> &OldCalleeToNewCallee,
+    ValueMap<CallInst *, Function *> &IndirectCallSiteToCallee) {
----------------
A lot of these functions have too many parameters and would benefit from moving these to be pass class members

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:571
+static bool
+isApplicableInst(Instruction *I,
+                 ValueMap<Function *, Function *> &NewCalleeToOldCallee) {
----------------
Function name and purpose is unclear

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:573
+                 ValueMap<Function *, Function *> &NewCalleeToOldCallee) {
+  // We are only interested in the inctructions within kernel or within new
+  // cloned functions.
----------------
Spelling inctructions

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:575
+  // cloned functions.
+#ifndef NDEBUG
+  assert(I && "Valid instruction expected\n");
----------------
Redundant ifndef

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:624
+
+    auto *CI = dyn_cast<CallInst>(V);
+    if (CI && CI->isIndirectCall()) {
----------------
CallBase, should also catch invokes

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:650-654
+static void
+replaceDirectCallSites(Module &M, std::map<Function *, uint64_t> &KernelToID,
+                       std::map<Function *, Instruction *> &KernelToBasePtrInst,
+                       Function *NewCallee,
+                       SetVector<CallInst *> &DirectCallSites) {
----------------
I don't think there should be a distinction between direct and indirect call sites. Both should behave the same way

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:940
+  // One is LDS layout base pointer which is of type `char*`, and other one is
+  // kernel number which is of type `Int64`.
+  //
----------------
Shouldn't refer to C types. Also is it really necessary to use i64 for the kernel number? 32-bits should be enough?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1051
+      M, Arr2DTy, false, GlobalValue::InternalLinkage, Const2D,
+      Twine("__LDSGlobalsOffsetTable__"), nullptr,
+      GlobalVariable::NotThreadLocal, AMDGPUAS::CONSTANT_ADDRESS);
----------------
Should make sure this uses a reserved name with periods/prefix. How about llvm.amdgcn.lds.globals.offset.table

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1057
+  LDSOffsetTable->setAlignment(
+      MaybeAlign(M.getDataLayout().getPreferredAlign(LDSOffsetTable)));
+
----------------
MaybeAlign should be unnecessary (or at least should be Align)

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1077
+  NewLDS->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
+  NewLDS->setAlignment(MaybeAlign(M.getDataLayout().getPreferredAlign(NewLDS)));
+
----------------
This needs to be the alignment implied by the individual variables

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1083
+static void
+computeTotalLDSSizeInBytes(ValueMap<GlobalVariable *, uint64_t> &LDSToSize,
+                           SetVector<GlobalVariable *> &KernelLDSList,
----------------
I don't see why we need a map of the variable to the size, you can always trivially compute it from the value

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1086
+                           std::map<GlobalVariable *, uint64_t> &LDSToOffset,
+                           uint64_t &TotalLDSSizeInBytes) {
+  // For the current kernel, compute the total size of all LDS globals, and also
----------------
Making this an out argument is confusing, just use the return value

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1087-1094
+  // For the current kernel, compute the total size of all LDS globals, and also
+  // offsets associated with them within in the new LDS global.
+  TotalLDSSizeInBytes = 0;
+  for (auto *LDS : KernelLDSList) {
+    LDSToOffset[LDS] = TotalLDSSizeInBytes;
+    TotalLDSSizeInBytes += LDSToSize[LDS];
+  }
----------------
This seems to be totally ignoring the alignment of the individual variables

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1096
+
+static void constructKernelSpecificLDSLayouts(
+    Module &M, SetVector<Function *> &Kernels,
----------------
Document function

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1167
+
+  // Create a single contigeous LDS global layout for each kernel.
+  std::map<Function *, std::map<GlobalVariable *, uint64_t>> KernelToLDSOffset;
----------------
Spelling contigeous

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1203-1204
+
+static void
+getLDSGlobalSizeInBytes(Module &M, GlobalVariable *LDSGlobal,
+                        ValueMap<GlobalVariable *, uint64_t> &LDSToSize) {
----------------
I think having this as a separate function just makes this harder to follow

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1206
+                        ValueMap<GlobalVariable *, uint64_t> &LDSToSize) {
+  LDSToSize[LDSGlobal] = getTypeStoreSizeInBytes(M, LDSGlobal->getValueType());
+}
----------------
This needs to be the the type allocation size

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1243-1247
+static void filterDeviceScopeLDSGlobals(
+    SetVector<GlobalVariable *> &LDSGlobals,
+    ValueMap<GlobalVariable *, Function *> &LDSToFunction) {
+  // Filter out all LDS globals which are defined within kernels since we don`t
+  // need to handle them.
----------------
This doesn't make sense, the same variable may appear in a kernel and in a function.

Also this refers to "device scope" which is not a concept in the backend

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1417-1420
+    if (auto *I = dyn_cast<Instruction>(U)) {
+      auto *F = I->getParent()->getParent();
+      if (!F)
+        continue;
----------------
What happens if an LDS variable is used in a constant initializer?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1422-1424
+      // We are only interested in device scope shared variables.
+      if (F->getCallingConv() != CallingConv::AMDGPU_KERNEL)
+        LDSToFunction[LDSGlobal] = F;
----------------
I don't understand this. The whole point of the pass is to handle non-kernel uses

"Device scope" is inappropriate to talk about in the backend, all the functions are device scope. 

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1486
+
+  // Perform all the necessary processing required.
+  processDeviceScopeSharedVariables(M, Kernels, LDSGlobals, LDSToFunction,
----------------
Pointless comment

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1505
+                      << ", skipping handling device of scope shared variables"
+                      << "\n");
+    return false;
----------------
Can merge this \n with the previous string

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1510
+  // Collect all the amdgpu kernels defined within the current module.
+  SetVector<Function *> Kernels;
+  for (auto &F : M.functions())
----------------
Don't see why this is a SetVector, the ordering shouldn't matter

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1511
+  SetVector<Function *> Kernels;
+  for (auto &F : M.functions())
+    if ((F.getCallingConv() == CallingConv::AMDGPU_KERNEL) &&
----------------
Braces

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1512
+  for (auto &F : M.functions())
+    if ((F.getCallingConv() == CallingConv::AMDGPU_KERNEL) &&
+        !F.isDeclaration())
----------------
Probably should handle all entry function CCs

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1519
+                      << ", skipping handling of device scope shared variables"
+                      << "\n");
+    return false;
----------------
Can merge this \n with the previous string

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1527-1529
+  LLVM_DEBUG(dbgs() << "===== Handling device scope shared variables in the "
+                       "module "
+                    << M.getName() << "\n");
----------------
Don't need this, the pass banner will be redundant

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1531-1533
+  // TODO: We only want to handle HIP kernels, and no kernels from from other
+  // programming languages, like OpenCL, OpenMP, etc. Do we need to add a
+  // condition here for it, and skip running the pass for non-HIP kernels?
----------------
Language doesn't matter, only the IR. This comment should be removed

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1534-1539
+  if (skipModule(M)) {
+    LLVM_DEBUG(dbgs() << "Skipping handling of device scope shared variables "
+                         "in the module "
+                      << M.getName() << "\n");
+    return false;
+  }
----------------
This is required for correctness, this can't be skipped

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1543-1545
+  LLVM_DEBUG(dbgs() << "===== Done with handling device scope shared variables "
+                       "in the module "
+                    << M.getName() << "\n");
----------------
Don't need this debug printing

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h:43
   static bool EnableFixedFunctionABI;
+  static bool EnableDeviceScopeSharedVariable;

----------------
I don't think this needs to be exposed here, there's no reason other places would need to inspect this

================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-deep-function-calls.ll:13
+
+; OFFSET-TABLE: @__LDSGlobalsOffsetTable__
+
----------------
This should check the whole definition

================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-indirect-call.ll:89
+; NEW-PARAM: i64 %1
+define internal void @_Z24function_two_with_no_ldsPiS_(i32* %i_arg, i32* %o_arg) unnamed_addr {
+; GCN-LABEL: entry:
----------------
Can you rename these functions to eliminate the C++ mangling? I'm guessing you derived this from an original C++ testcase, but you can make the IR test more human readable

================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-indirect-call.ll:98
+
+; NEW-PARAM: i8 addrspace(3)* %0
+; NEW-PARAM: i64 %1
----------------
Need some -LABEL checks

================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-with-different-data-types.ll:109-114
+  %0 = addrspacecast i8 addrspace(1)* %ci_arg.coerce to i8*
+  %1 = addrspacecast i8 addrspace(1)* %co_arg.coerce to i8*
+  %2 = addrspacecast i32 addrspace(1)* %ii_arg.coerce to i32*
+  %3 = addrspacecast i32 addrspace(1)* %io_arg.coerce to i32*
+  %4 = addrspacecast float addrspace(1)* %fi_arg.coerce to float*
+  %5 = addrspacecast float addrspace(1)* %fo_arg.coerce to float*
----------------
Should use named values. Also we don't really need this addrspacecast noise at all. You can replace the target function with the original global address space

================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-two-lds-arguments.ll:181
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x()
----------------
Need some tests with some more exotic users, including
  # atomicrmw
  # cmpxchg
  # LDS intrinsics
  # Store of LDS pointer value
  # Store of LDS pointer value to itself
  # LDS global used inside a nested constant expression
  # Recursive LDS constant expression
  # Same LDS variable appears in a kernel and a called function

Also, need tests with 0 sized LDS variables

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91516/new/

https://reviews.llvm.org/D91516