[llvm] Enable .ptr .global .align attributes for kernel attributes for CUDA (PR #114874)

Wed Nov 6 09:43:03 PST 2024

================
@@ -1600,29 +1600,37 @@ void NVPTXAsmPrinter::emitFunctionParamList(const Function *F, raw_ostream &O) {
 
       if (isKernelFunc) {
         if (PTy) {
-          // Special handling for pointer arguments to kernel
           O << "\t.param .u" << PTySizeInBits << " ";
 
-          if (static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() !=
-              NVPTX::CUDA) {
-            int addrSpace = PTy->getAddressSpace();
-            switch (addrSpace) {
-            default:
-              O << ".ptr ";
-              break;
-            case ADDRESS_SPACE_CONST:
-              O << ".ptr .const ";
-              break;
-            case ADDRESS_SPACE_SHARED:
-              O << ".ptr .shared ";
-              break;
-            case ADDRESS_SPACE_GLOBAL:
-              O << ".ptr .global ";
-              break;
-            }
-            Align ParamAlign = I->getParamAlign().valueOrOne();
-            O << ".align " << ParamAlign.value() << " ";
+          int addrSpace = PTy->getAddressSpace();
+          const bool IsCUDA =
+              static_cast<NVPTXTargetMachine &>(TM).getDrvInterface() ==
+              NVPTX::CUDA;
+
+          O << ".ptr ";
+          switch (addrSpace) {
+          default:
+            // Special handling for pointer arguments to kernel
+            // CUDA kernels assume that pointers are in global address space
+            // See:
+            // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parameter-state-space
+            if (IsCUDA)
+              O << " .global ";
----------------
LewisCrawford wrote:

The original patch which gave us perf improvements did so by assuming all pointers used as kernel args are .global. 

However, I have been struggling to reproduce that case locally myself, and as far as I can tell from the original discussions, it might only be the missing .align info that was needed anyway (as the address space can be inferred from things like `cvta.to.global.u64 ` and `ld.global.nc` already).

I agree that it is much cleaner to just honour the LLVM address space, rather than making undocumented and potentially invalid assumptions. If there turn out to be strong perf reasons for assuming .global for generic pointers later on, we should probably document this implicit assumption better somewhere.

I've now also added support for emitting .local here, as the PTX spec says this is technically allowed here too: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#kernel-parameter-attribute-ptr

https://github.com/llvm/llvm-project/pull/114874