[PATCH] D91590: [NVPTX] Efficently support dynamic index on CUDA kernel aggregate parameters.

Wed Nov 18 00:15:00 PST 2020

hliao added a comment.

As mentioned earlier, that's very experimental support. Even though the SASS looks reasonable, it still needs verifying on real systems. For non-kernel functions, it seems we share the path. So that we should do a similar thing. The current approach fixes that in the codegen phase by adding back the `alloca` to match the parameter space semantic. Once that alloca is dynamically indexed, it won't be promoted in SROA. Only `instcomb` eliminates that `alloca` when it is only modified once by copying from a constant memory. As `instcomb` won't break certain patterns prepared in the codegen preparation, it won't run in the backend. That dynamically indexed `alloca` won't be removed.

================
Comment at: clang/test/CodeGenCUDA/kernel-args.cu:13-14
 // AMDGCN: define amdgpu_kernel void @_Z6kernel1A(%struct.A addrspace(4)* byref(%struct.A) align 8 %{{.+}})
-// NVPTX: define void @_Z6kernel1A(%struct.A* byval(%struct.A) align 8 %x)
+// NVPTX: define void @_Z6kernel1A(%struct.A addrspace(101)* byref(%struct.A) align 8 %0)
 __global__ void kernel(A x) {
 }
----------------
tra wrote:
> Is the idea here to rely on PTX to store the value in param space (so we do actually pass the parameter by value)  and represent it on IR level as a reference to an an externally-provided storage with the value.
> So:
> - C++ passes argument by value
> - IR knows that PTX will store it somewhere in param space and uses `byref`
> - we still generate PTX which has parameter passed by value, but now we can access it directly via a reference to param-space value.
> 
> Presumably for parameters we do want to modify, we'll need to fall back to having a local copy.
> 
> So far so good. However, now we may have a problem distinguishing between C++-level arguments passed by value vs by reference -- they all will look like `byref` on IR level. That is, unless you rely on `addrspace(101)` to indicate that it's actually a `byval` in disguise. 
> 
> It looks plausible as long as we can guarantee that we never modify it. Neither in the current function nor in any of the callees, if we pass it by reference. 
> 
> I'm not particularly familiar with AA machinery. I'd appreciate if you could elaborate on how you see it all work end-to-end.
> 
It does the same thing as `nvptx-lower-args` does but applies that earlier in the frontend. The upside is that IR is optimized by all the middle-end opts. `instcomb` will remove that dynamically indexed `alloca` if it's only modified by copying from constant memory. AA teaches the compiler that parameter space has the property of constantness. Even though we run SROA after `nvptx-lower-args`, but we general won't run `instcomb` in the backend as it potentially breaks certain patterns prepared in the codegen preparation phase.

`byref` (newly added) in LLVM IR is different from by-reference in C++. The later is translated into a pointer. `byref` in LLVM IR says that content of that pointer should not be modified in the function body. It won't be ambiguous from the IR side.

It's still possible for the backend to do similar stuff. Once that `byval` argument has `readonly`, that `alloca` could be skipped.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91590/new/

https://reviews.llvm.org/D91590