[llvm] [NVTPX] Copy kernel arguments as byte array (PR #110356)

Wed Oct 2 11:35:19 PDT 2024

================
@@ -623,13 +623,33 @@ void NVPTXLowerArgs::handleByValParam(const NVPTXTargetMachine &TM,
     Value *ArgInParam = new AddrSpaceCastInst(
         Arg, PointerType::get(Arg->getContext(), ADDRESS_SPACE_PARAM),
         Arg->getName(), FirstInst);
+    // Create an opaque type of same size as StructType but without padding
+    // holes as this could have been a union.
+    const auto StructBytes = *AllocA->getAllocationSize(DL);
+    SmallVector<Type *, 5> ChunkTypes;
+    if (StructBytes >= 16) {
+        Type *IntType = Type::getInt64Ty(Func->getContext());
+        Type *ChunkType = VectorType::get(IntType, 2, false);
+        Type *OpaqueType = StructBytes < 32 ? ChunkType :
+                           ArrayType::get(ChunkType, StructBytes / 16);
+        ChunkTypes.push_back(OpaqueType);
+    }
+    for (const auto ChunkBytes: {8, 4, 2, 1}) {
+      if (StructBytes & ChunkBytes) {
+          Type *ChunkType = Type::getIntNTy(Func->getContext(), 8 * ChunkBytes);
+          ChunkTypes.push_back(ChunkType);
+      }
+    }
+    Type * OpaqueType = ChunkTypes.size() == 1 ? ChunkTypes[0] :
+                        StructType::create(ChunkTypes);
----------------
Artem-B wrote:

I don't think there's any benefit in using types larger than `i32`. As long as we preserve alignment info, we should end up vectorizing loads/stores. `ld/st.v4.b32` is as good as it gets, considering that under the hood it, and `v2.b64` will end up using the same 128-bit loads/stores into four 32-bit registers.

https://github.com/llvm/llvm-project/pull/110356