[llvm] [NVPTX] Add cta_group support to TMA G2S intrinsics (PR #143178)

Mon Jun 9 14:40:06 PDT 2025

================
@@ -1034,18 +1034,22 @@ source tensor is preserved at the destination. The dimension of the
 tensor data ranges from 1d to 5d with the coordinates specified
 by the ``i32 %d0 ... i32 %d4`` arguments.
 
-* The last two arguments to these intrinsics are boolean flags
-  indicating support for cache_hint and/or multicast modifiers.
-  These flag arguments must be compile-time constants. The backend
-  looks through these flags and lowers the intrinsics appropriately.
+* The last three arguments to these intrinsics are boolean flags
+  indicating support for multicast, cache_hint and cta_group::2
+  modifiers. These flag arguments must be compile-time constants.
+  The backend looks through these flags and lowers the intrinsics
+  appropriately.
----------------
Artem-B wrote:

OK. I don't have a better solution.

The proliferation of the instruction variants in NVPTX and the fact that compiler can't really do much with most of them makes me think that what we actually need is a more flexible inline assembly. AFAICT, the only benefit intrinsics have right now over inline asm is that they allow specifying inline asm attributes which gives compiler some hints about the instruction behavior. If there would be a way to let the user provide those attributes themselves, we would not need these intrinsics at all. The users could then generate whatever instructions they need to with appropriate hints for the compiler. I guess we could theoretically archive that via target-specific clobber list provided to inline asm.

With the users being in control of the exotic instructions, LLVM would then only deal with the intrinsics where compiler can do something meaningful.

This is just me thinking out loud. I haven't looked at the details.

https://github.com/llvm/llvm-project/pull/143178