[llvm] Enable .ptr .global .align attributes for kernel attributes for CUDA (PR #114874)

Wed Nov 6 09:42:50 PST 2024

================
@@ -1600,29 +1600,37 @@ void NVPTXAsmPrinter::emitFunctionParamList(const Function *F, raw_ostream &O) {
 
       if (isKernelFunc) {
         if (PTy) {
-          // Special handling for pointer arguments to kernel
           O << "\t.param .u" << PTySizeInBits << " ";
 
-          if (static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() !=
-              NVPTX::CUDA) {
-            int addrSpace = PTy->getAddressSpace();
-            switch (addrSpace) {
-            default:
-              O << ".ptr ";
-              break;
-            case ADDRESS_SPACE_CONST:
-              O << ".ptr .const ";
-              break;
-            case ADDRESS_SPACE_SHARED:
-              O << ".ptr .shared ";
-              break;
-            case ADDRESS_SPACE_GLOBAL:
-              O << ".ptr .global ";
-              break;
-            }
-            Align ParamAlign = I->getParamAlign().valueOrOne();
-            O << ".align " << ParamAlign.value() << " ";
+          int addrSpace = PTy->getAddressSpace();
+          const bool IsCUDA =
+              static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() ==
+              NVPTX::CUDA;
+
+          O << ".ptr ";
+          switch (addrSpace) {
+          default:
+            // Special handling for pointer arguments to kernel
+            // CUDA kernels assume that pointers are in global address space
+            // See:
+            // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameter-state-space
+            if (IsCUDA)
+              O << " .global ";
+            break;
+          case ADDRESS_SPACE_CONST:
+            O << " .const ";
+            break;
+          case ADDRESS_SPACE_SHARED:
+            O << " .shared ";
+            break;
+          case ADDRESS_SPACE_GLOBAL:
+            O << " .global ";
+            break;
           }
+
+          Align ParamAlign = I->getParamAlign().valueOrOne();
+          if (ParamAlign != 1 || !IsCUDA)
----------------
LewisCrawford wrote:

I agree that makes more sense. I wasn't sure whether there was a good reason for avoiding emitting .align 1 in the original patch, so I kept that behaviour intact.

I've changed the code to emit all explicitly specified alignments on CUDA now.

Do we still need to preserve the CUDA vs OpenCL behavioural difference of emitting .align 1 for CL by default, and having no .align specifier for CUDA by default if no align is specified too, or can they both be changed to just not emit the alignment if it is not specified?

https://github.com/llvm/llvm-project/pull/114874