[PATCH] D138141: [amdgpu] Reimplement LDS lowering

Thu Dec 1 06:02:41 PST 2022

JonChesterfield added a comment.

@arsenm thanks, all addressed. Constantexpr were a recurring pain for this pass because their uniqueness interferes with the callgraph walking so are unconditionally rewritten into instructions at the top of run().

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:589
+        if (!I) {
+          continue;
+        }
----------------
arsenm wrote:
> The constant expression case matters? 
> 
> In particular instruction uses of addrspacecasted globals come up a lot. The constant initializer case is harder 
Constant expressions involving LDS variables are unconditionally expanded before this point by `eliminateConstantExprUsesOfLDSFromAllInstructions`.

Primarily because one constantexpr can be reachable from multiple kernels through instructions which are only reachable from distinct kernels, so eliminating them up front makes the analysis more precise as well as the rest of this pass much simpler.

Constant initializer is indeed harder and is presently out of scope - LDS variables have undef as their initialiser, and non-LDS variables have program scope which is broadly resistant to per-kernel-execution initialisation with the address of LDS variables.

An exception for that can be carved out for invariant LDS addresses (e.g. if we want an attribute to say a given global is always at offset 42, whatever kernel is involved), but we don't have a concept for that implemented yet beyond the emergent feature of the back end calculating offsets into a per-kernel struct.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:477
+
+    return {direct_map_kernel, indirect_map_kernel};
+  }
----------------
jmmartinez wrote:
> I might be wrong, but I think the copy-elision is done for the newly created pair, but not for the arguments used for create it.
> 
> I'd rather do:
> ```
> return {std::move(direct_map_kernel), std::move(indirect_map_kernel)};
> ```
> 
> Without move: https://godbolt.org/z/f5Ghr8M1o
> With move: https://godbolt.org/z/WeG7Y8nY8
sure

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:491
+  static Constant *getAddressesOfVariablesInKernel(
+      LLVMContext &Ctx, std::vector<GlobalVariable *> Variables,
+      DenseMap<GlobalVariable *, Constant *> &LDSVarsToConstantGEP) {
----------------
jmmartinez wrote:
> foad wrote:
> > jmmartinez wrote:
> > > Is there a missing const& for the Variables argument ?
> > Isn't `std::vector<GlobalVariable *> const &` is just a long way of saying `ArrayRef<GlobalVariable *>`?
> +1
arrayref is good, thanks

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:517
+      Module &M, std::vector<GlobalVariable *> const &Variables,
+      std::vector<Function *> kernels,
+      DenseMap<Function *, LDSVariableReplacement> &KernelToReplacement) {
----------------
jmmartinez wrote:
> Could you use a const& for kernels ?
got some mileage out of ArrayRef for these arguments, thanks

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:735-738
+      DenseSet<Function *> const &HybridModuleRootKernels =
+          HybridModuleRoot
+              ? LDSToKernelsThatNeedToAccessItIndirectly[HybridModuleRoot]
+              : DenseSet<Function *>{};
----------------
jmmartinez wrote:
> In the case HybridModuleRoot is null, the reference HybridModuleRootKernels will point to a temporary that has already been destroyed.
Today I learned temporary lifetime extension doesn't work through the ternary operator. Sad times. Also using nullptr as the key into the map wins me a segfault, so gone with ugly workaround pending a better idea.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:1154
+  void replaceLDSVariablesWithStructVec(
+      Module &M, std::vector<GlobalVariable *> const &LDSVarsToTransformArg,
+      LDSVariableReplacement Replacement, PredicateTy Predicate) {
----------------
jmmartinez wrote:
> What do you think about taking an r-value reference for LDSVarsToTransformArg. That would avoid the copy into the local variable LDSVarsToTransform.
this was refactoring noise, inlined replaceLDSVariablesWithStructVec into replaceLDSVariablesWithStruct to remove

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:795-796
           } else {
-            // However if we are certain this kernel cannot call a function that
-            // requires module LDS, annotate the kernel so the backend can elide
-            // the allocation without repeating callgraph walks.
-            Func.addFnAttr("amdgpu-elide-module-lds");
+            report_fatal_error("Cannot lower LDS to kernel access as it is "
+                               "reachable from multiple kernels");
           }
----------------
arsenm wrote:
> Probably should promote this to a DiagnosticInfoUnsupported
i think this is OK as is - it's an undocumented command line flag, where "kernel" is commented as only applicable in narrow cases - and the report_fatal_error path is otherwise only reachable on a bug in this pass

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:14
 //
-// Local variables with constant annotation or non-undef initializer are passed
+// The programming model is global variables, or equivalently function local
+// static variables, accessible from kernels or other functions. For uses from
----------------
arsenm wrote:
> I somehow missed this giant comment in the first pass. Perhaps this should move to AMDGPUUsage 
reluctant to document this in user facing fashion - there's a credible risk of it being improved in the future, and none of the flags here are intended to be used on a per application basis

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp:640
+        Value *replacement =
+            Builder.CreateIntToPtr(extended, GV->getType(), GV->getName());
+
----------------
arsenm wrote:
> JonChesterfield wrote:
> > arsenm wrote:
> > > Can you add an assert somewhere that the maximum LDS size fits in 16-bits? I'm also wondering if we should just make LDS pointers 16-bit with 4 byte alignment, or add a 16-bit address space for this
> > The backend currently errors on that in user-facing fashion. A check here will be slightly less accurate than in the back end because this runs before promote alloca puts more things into LDS and I'm not sure we have more information to provide a better report either - what sort of error message do you have in mind?
> > 
> > I suppose the current code path would truncate the offsets to i16 before the back end crashes, but that seems harmless. We'd have slightly cleaner tests with a native sixteen bit type but there's not much in it.
> I don't mean here, I mean somewhere. Promote alloca also shouldn't be pushing the usage up into something unsupportable
changed it to i32 for now, can always back it back down to i16 with more error reporting later if we want to. Stopped short of dropping inttoptr as it's difficult to test with D131246 applied

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138141/new/

https://reviews.llvm.org/D138141