[PATCH] D91516: [AMDGPU] Support for device scope shared variables
Matt Arsenault via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jan 7 09:27:16 PST 2021
arsenm added a comment.
This is ignoring the alignments of the variables. This needs some tests to stress padding edge cases, and also probably should try sorting to minimizing padding.
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:59-60
+// computed as in above item C.
+// F. Within each kernel, type cast the corresponding single big shared memory
+// layout to `char*`, and pass this type-casted pointer and the kernel
+// number as new function arguments along the call graph path(s) so that
----------------
No reason to mention C types
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:70
+//
+// Is one really exist?
+//
----------------
Don't understand this comment
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:197-198
+INITIALIZE_PASS(AMDGPUDeviceScopeSharedVariable,
+ "implement-amdgpu-device-scope-shared-variable",
+ "Implement AMDPGU Device Scope Shared Variable",
+ false /*only look at the cfg*/, false /*analysis pass*/)
----------------
I don't like including "implement" in the name, and most of the backend doesn't refer to this as shared memory. How about "amdgpu-lower-local-local-memory-globals"
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:198
+ "implement-amdgpu-device-scope-shared-variable",
+ "Implement AMDPGU Device Scope Shared Variable",
+ false /*only look at the cfg*/, false /*analysis pass*/)
----------------
Spelling AMDPGU
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:201-207
+static unsigned getTypeStoreSizeInBits(Module &M, Type *Ty) {
+ return M.getDataLayout().getTypeSizeInBits(Ty).getFixedSize();
+}
+
+static unsigned getTypeStoreSizeInBytes(Module &M, Type *Ty) {
+ return getTypeStoreSizeInBits(M, Ty) / 8;
+}
----------------
These functions only make this harder to read. You're losing value vs. just directly using the datalayout functions
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:242
+ // actually defined.
+#ifndef NDEBUG
+ assert(I->getParent()->getParent() == F &&
----------------
ifndef is redundant
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:250
+ // of `LDS` by `LDSAccessInst`.
+ Instruction *NewI = I->clone();
+ unsigned Ind = 0;
----------------
This function is too mechanically complicated. There shouldn't be any need to clone any instructions. Only individual values need replacing
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:281
+ // Cast away const-ness from `U`.
+ User *UU = const_cast<User *>(U);
+
----------------
Should not const_cast
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:292
+ } else
+ llvm_unreachable("Not Implemented."); // TODO: What else is missing?
+ }
----------------
Should be plain Constant, not just ConstantExpr
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:304-306
+ // Suffix the names of the instructions with unique integer values.
+ static int Suffix = 0;
+ ++Suffix;
----------------
This happens for you automatically, you don't need to manage this
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:327
+ KIndex, LIndex};
+ auto *GEPInst = GetElementPtrInst::CreateInBounds(
+ LDSOffsetTable->getValueType(), LDSOffsetTable, Indices,
----------------
Should use IRBuilder instead of the more cumbersome instruction constructors
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:332
+ // Insert LOAD instruction which loads `offset` value from LDS offset table.
+ auto *LInst = new LoadInst(GEPInst->getType()->getPointerElementType(),
+ GEPInst, Twine("dssv.load.") + Twine(Suffix),
----------------
Don't use the pointee element type
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:355-359
+static void updateFunctionAssociatedWithLDS(
+ Module &M, GlobalVariable *LDS,
+ ValueMap<GlobalVariable *, Function *> &LDSToFunction,
+ std::map<GlobalVariable *, uint64_t> &LDSToID,
+ GlobalVariable *LDSOffsetTable) {
----------------
Just inline this function?
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:428-432
+static void
+updateNonCallInsts(ValueMap<Function *, Function *> &OldCalleeToNewCallee,
+ SetVector<Instruction *> &NonCallInsts) {
+ replaceNonCallInsts(OldCalleeToNewCallee, NonCallInsts);
+}
----------------
Pointless function
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:454-457
+#ifndef NDEBUG
+ assert(MD && "Metadata about indirect call targets expected\n");
+#endif
+
----------------
This is not OK. The metadata is purely optional
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:469-470
+static void copyDataFromOldToNewCallSite(CallInst *OldCI, CallInst *NewCI) {
+ // TODO: Why copyMetadata() not copying meta data. I see metadat associated
+ // with CI, but it is not copied to NewCI. CI->hasMetadata() is false, why?
+ NewCI->copyMetadata(*OldCI);
----------------
Comment doesn't make sense
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:480
+ auto *BasePtrTy = Type::getInt8PtrTy(M.getContext(), AMDGPUAS::LOCAL_ADDRESS);
+ auto *KernNumTy = Type::getInt64Ty(M.getContext());
+
----------------
i32 should be more than sufficient
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:484-485
+ auto *FnTy = Callee->getFunctionType();
+ for (auto PI = FnTy->param_begin(), PE = FnTy->param_end(); PI != PE; ++PI)
+ NewParams.push_back(*PI);
+
----------------
I think you can do this in the constructor like NewParams(param_begin(), param_end())
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:496
+
+static void
+getNewArgumentList(Module &M, std::map<Function *, uint64_t> &KernelToID,
----------------
Comment this function
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:507-508
+ auto KernelNum = KernelToID[Caller];
+ KernNumArg = Constant::getIntegerValue(Type::getInt64Ty(M.getContext()),
+ APInt(64, KernelNum));
+ } else {
----------------
ConstantInt::get or IRBuilder::getInt64 is simpler
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:521-525
+static void replaceIndirectCallSites(
+ Module &M, std::map<Function *, uint64_t> &KernelToID,
+ std::map<Function *, Instruction *> &KernelToBasePtrInst,
+ ValueMap<Function *, Function *> &OldCalleeToNewCallee,
+ ValueMap<CallInst *, Function *> &IndirectCallSiteToCallee) {
----------------
A lot of these functions have too many parameters and would benefit from moving these to be pass class members
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:571
+static bool
+isApplicableInst(Instruction *I,
+ ValueMap<Function *, Function *> &NewCalleeToOldCallee) {
----------------
Function name and purpose is unclear
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:573
+ ValueMap<Function *, Function *> &NewCalleeToOldCallee) {
+ // We are only interested in the inctructions within kernel or within new
+ // cloned functions.
----------------
Spelling inctructions
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:575
+ // cloned functions.
+#ifndef NDEBUG
+ assert(I && "Valid instruction expected\n");
----------------
Redundant ifndef
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:624
+
+ auto *CI = dyn_cast<CallInst>(V);
+ if (CI && CI->isIndirectCall()) {
----------------
CallBase, should also catch invokes
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:650-654
+static void
+replaceDirectCallSites(Module &M, std::map<Function *, uint64_t> &KernelToID,
+ std::map<Function *, Instruction *> &KernelToBasePtrInst,
+ Function *NewCallee,
+ SetVector<CallInst *> &DirectCallSites) {
----------------
I don't think there should be a distinction between direct and indirect call sites. Both should behave the same way
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:940
+ // One is LDS layout base pointer which is of type `char*`, and other one is
+ // kernel number which is of type `Int64`.
+ //
----------------
Shouldn't refer to C types. Also is it really necessary to use i64 for the kernel number? 32-bits should be enough?
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1051
+ M, Arr2DTy, false, GlobalValue::InternalLinkage, Const2D,
+ Twine("__LDSGlobalsOffsetTable__"), nullptr,
+ GlobalVariable::NotThreadLocal, AMDGPUAS::CONSTANT_ADDRESS);
----------------
Should make sure this uses a reserved name with periods/prefix. How about llvm.amdgcn.lds.globals.offset.table
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1057
+ LDSOffsetTable->setAlignment(
+ MaybeAlign(M.getDataLayout().getPreferredAlign(LDSOffsetTable)));
+
----------------
MaybeAlign should be unnecessary (or at least should be Align)
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1077
+ NewLDS->setUnnamedAddr(GlobalValue::UnnamedAddr::Global);
+ NewLDS->setAlignment(MaybeAlign(M.getDataLayout().getPreferredAlign(NewLDS)));
+
----------------
This needs to be the alignment implied by the individual variables
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1083
+static void
+computeTotalLDSSizeInBytes(ValueMap<GlobalVariable *, uint64_t> &LDSToSize,
+ SetVector<GlobalVariable *> &KernelLDSList,
----------------
I don't see why we need a map of the variable to the size, you can always trivially compute it from the value
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1086
+ std::map<GlobalVariable *, uint64_t> &LDSToOffset,
+ uint64_t &TotalLDSSizeInBytes) {
+ // For the current kernel, compute the total size of all LDS globals, and also
----------------
Making this an out argument is confusing, just use the return value
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1087-1094
+ // For the current kernel, compute the total size of all LDS globals, and also
+ // offsets associated with them within in the new LDS global.
+ TotalLDSSizeInBytes = 0;
+ for (auto *LDS : KernelLDSList) {
+ LDSToOffset[LDS] = TotalLDSSizeInBytes;
+ TotalLDSSizeInBytes += LDSToSize[LDS];
+ }
----------------
This seems to be totally ignoring the alignment of the individual variables
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1096
+
+static void constructKernelSpecificLDSLayouts(
+ Module &M, SetVector<Function *> &Kernels,
----------------
Document function
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1167
+
+ // Create a single contigeous LDS global layout for each kernel.
+ std::map<Function *, std::map<GlobalVariable *, uint64_t>> KernelToLDSOffset;
----------------
Spelling contigeous
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1203-1204
+
+static void
+getLDSGlobalSizeInBytes(Module &M, GlobalVariable *LDSGlobal,
+ ValueMap<GlobalVariable *, uint64_t> &LDSToSize) {
----------------
I think having this as a separate function just makes this harder to follow
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1206
+ ValueMap<GlobalVariable *, uint64_t> &LDSToSize) {
+ LDSToSize[LDSGlobal] = getTypeStoreSizeInBytes(M, LDSGlobal->getValueType());
+}
----------------
This needs to be the the type allocation size
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1243-1247
+static void filterDeviceScopeLDSGlobals(
+ SetVector<GlobalVariable *> &LDSGlobals,
+ ValueMap<GlobalVariable *, Function *> &LDSToFunction) {
+ // Filter out all LDS globals which are defined within kernels since we don`t
+ // need to handle them.
----------------
This doesn't make sense, the same variable may appear in a kernel and in a function.
Also this refers to "device scope" which is not a concept in the backend
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1417-1420
+ if (auto *I = dyn_cast<Instruction>(U)) {
+ auto *F = I->getParent()->getParent();
+ if (!F)
+ continue;
----------------
What happens if an LDS variable is used in a constant initializer?
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1422-1424
+ // We are only interested in device scope shared variables.
+ if (F->getCallingConv() != CallingConv::AMDGPU_KERNEL)
+ LDSToFunction[LDSGlobal] = F;
----------------
I don't understand this. The whole point of the pass is to handle non-kernel uses
"Device scope" is inappropriate to talk about in the backend, all the functions are device scope.
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1486
+
+ // Perform all the necessary processing required.
+ processDeviceScopeSharedVariables(M, Kernels, LDSGlobals, LDSToFunction,
----------------
Pointless comment
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1505
+ << ", skipping handling device of scope shared variables"
+ << "\n");
+ return false;
----------------
Can merge this \n with the previous string
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1510
+ // Collect all the amdgpu kernels defined within the current module.
+ SetVector<Function *> Kernels;
+ for (auto &F : M.functions())
----------------
Don't see why this is a SetVector, the ordering shouldn't matter
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1511
+ SetVector<Function *> Kernels;
+ for (auto &F : M.functions())
+ if ((F.getCallingConv() == CallingConv::AMDGPU_KERNEL) &&
----------------
Braces
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1512
+ for (auto &F : M.functions())
+ if ((F.getCallingConv() == CallingConv::AMDGPU_KERNEL) &&
+ !F.isDeclaration())
----------------
Probably should handle all entry function CCs
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1519
+ << ", skipping handling of device scope shared variables"
+ << "\n");
+ return false;
----------------
Can merge this \n with the previous string
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1527-1529
+ LLVM_DEBUG(dbgs() << "===== Handling device scope shared variables in the "
+ "module "
+ << M.getName() << "\n");
----------------
Don't need this, the pass banner will be redundant
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1531-1533
+ // TODO: We only want to handle HIP kernels, and no kernels from from other
+ // programming languages, like OpenCL, OpenMP, etc. Do we need to add a
+ // condition here for it, and skip running the pass for non-HIP kernels?
----------------
Language doesn't matter, only the IR. This comment should be removed
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1534-1539
+ if (skipModule(M)) {
+ LLVM_DEBUG(dbgs() << "Skipping handling of device scope shared variables "
+ "in the module "
+ << M.getName() << "\n");
+ return false;
+ }
----------------
This is required for correctness, this can't be skipped
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUDeviceScopeSharedVariable.cpp:1543-1545
+ LLVM_DEBUG(dbgs() << "===== Done with handling device scope shared variables "
+ "in the module "
+ << M.getName() << "\n");
----------------
Don't need this debug printing
================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.h:43
static bool EnableFixedFunctionABI;
+ static bool EnableDeviceScopeSharedVariable;
----------------
I don't think this needs to be exposed here, there's no reason other places would need to inspect this
================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-deep-function-calls.ll:13
+
+; OFFSET-TABLE: @__LDSGlobalsOffsetTable__
+
----------------
This should check the whole definition
================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-indirect-call.ll:89
+; NEW-PARAM: i64 %1
+define internal void @_Z24function_two_with_no_ldsPiS_(i32* %i_arg, i32* %o_arg) unnamed_addr {
+; GCN-LABEL: entry:
----------------
Can you rename these functions to eliminate the C++ mangling? I'm guessing you derived this from an original C++ testcase, but you can make the IR test more human readable
================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-indirect-call.ll:98
+
+; NEW-PARAM: i8 addrspace(3)* %0
+; NEW-PARAM: i64 %1
----------------
Need some -LABEL checks
================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-lds-with-different-data-types.ll:109-114
+ %0 = addrspacecast i8 addrspace(1)* %ci_arg.coerce to i8*
+ %1 = addrspacecast i8 addrspace(1)* %co_arg.coerce to i8*
+ %2 = addrspacecast i32 addrspace(1)* %ii_arg.coerce to i32*
+ %3 = addrspacecast i32 addrspace(1)* %io_arg.coerce to i32*
+ %4 = addrspacecast float addrspace(1)* %fi_arg.coerce to float*
+ %5 = addrspacecast float addrspace(1)* %fo_arg.coerce to float*
----------------
Should use named values. Also we don't really need this addrspacecast noise at all. You can replace the target function with the original global address space
================
Comment at: llvm/test/CodeGen/AMDGPU/device-scope-lds-test-two-lds-arguments.ll:181
+}
+
+declare i32 @llvm.amdgcn.workitem.id.x()
----------------
Need some tests with some more exotic users, including
# atomicrmw
# cmpxchg
# LDS intrinsics
# Store of LDS pointer value
# Store of LDS pointer value to itself
# LDS global used inside a nested constant expression
# Recursive LDS constant expression
# Same LDS variable appears in a kernel and a called function
Also, need tests with 0 sized LDS variables
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D91516/new/
https://reviews.llvm.org/D91516
More information about the llvm-commits
mailing list