[llvm] [NVPTX] Add NVPTX intrinsics for TMA copies (PR #95289)

Thu Jun 13 12:53:46 PDT 2024

================
@@ -1448,6 +1448,26 @@ defm int_nvvm_cp_async_ca_shared_global_8 : CP_ASYNC_SHARED_GLOBAL<"8", "ca">;
 defm int_nvvm_cp_async_ca_shared_global_16 : CP_ASYNC_SHARED_GLOBAL<"16", "ca">;
 defm int_nvvm_cp_async_cg_shared_global_16 : CP_ASYNC_SHARED_GLOBAL<"16", "cg">;
 
+// TODO(apaszke): Multicast TMA loads
----------------
durga4github wrote:

[Only for the TMA intrinsics part]:
The TMA intrinsic changes look good to me.

We have the intrinsics implemented with tests for cp.async.bulk.tensor (including multicast, cache-hint, im2col variants). I can submit the intrinsic changes by early next week, if we all agree and that timeline is acceptable for us.  @apaszke, @jlebar Please let me know what you think.

https://github.com/llvm/llvm-project/pull/95289