[llvm] [LLVM][NVPTX] Add codegen support for tcgen05.{ld, st} instructions (PR #126740)

Tue Feb 11 07:03:33 PST 2025

llvmbot wrote:




@llvm/pr-subscribers-llvm-ir

Author: Pradeep Kumar (schwarzschild-radius)

<details>
<summary>Changes</summary>

This commit adds support for tcgen05.{ld, st} instructions with lit tests under tcgen05-ld.ll and tcgen05-st.ll and intrinsics documentation under NVPTXUsage.rst

---

Patch is 175.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126740.diff


9 Files Affected:

- (modified) llvm/docs/NVPTXUsage.rst (+95) 
- (modified) llvm/include/llvm/IR/Intrinsics.td (+1) 
- (modified) llvm/include/llvm/IR/IntrinsicsNVVM.td (+65) 
- (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp (+302) 
- (modified) llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.h (+2) 
- (modified) llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp (+416) 
- (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+86) 
- (added) llvm/test/CodeGen/NVPTX/tcgen05-ld.ll (+335) 
- (added) llvm/test/CodeGen/NVPTX/tcgen05-st.ll (+981) 


``````````diff

diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 1680b1143353701..d5aed5b3e904ebd 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -1175,6 +1175,101 @@ For more information, refer to the PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensorcore-5th-generation-instructions-tcgen05-fence>`_.
 
 
+'``llvm.nvvm.tcgen05.ld.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare <n x i32> @llvm.nvvm.tcgen05.ld.<shape>.<num>(ptr addrspace(6) %tmem_addr, i1 %pack)
+
+  declare <n x i32> @llvm.nvvm.tcgen05.ld.16x32bx2.<num>(ptr addrspace(6) %tmem_addr, i64 %offset, i1 %pack)
+
+Overview:
+"""""""""
+
+This group of intrinsics asynchronously load data from the Tensor Memory at the location specified
+by the 32-bit address operand `tmem_addr` into the destination registers, collectively across all threads
+of the warps.
+
+All the threads in the warp must specify the same value of `tmem_addr`, which must be the base address
+of the collective load operation. Otherwise, the behavior is undefined.
+
+The `shape` qualifier and the `num` qualifier together determines the total dimension of the data ('n') which
+is loaded from the Tensor Memory. The `shape` qualifier indicates the base dimension of data. The `num` qualifier
+indicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.
+
+Allowed values for the 'num' are `x1, x2, x4, x8, x16, x32, x64, x128`.
+
+Allowed values for the 'shape' in the first intrinsic are `16x64b, 16x128b, 16x256b, 32x32b`.
+
+Allowed value for the 'shape' in the second intrinsic is `16x32bx2`.
+
+The result of the intrinsic is a vector consisting of one or more 32-bit registers derived from `shape` and
+`num` as shown below.
+
+=========== =========================  ==========  ==========
+ num/shape     16x32bx2/16x64b/32x32b    16x128b    16x256b
+=========== =========================  ==========  ==========
+ x1                 1                      2           4
+ x2                 2                      4           8
+ x4                 4                      8           16
+ x8                 8                      16          32
+ x16                16                     32          64
+ x32                32                     64          128
+ x64                64                     128         NA
+ x128               128                    NA          NA
+=========== =========================  ==========  ==========
+
+The last argument `i1 %pack` is a compile-time constant which when set, indicates that the adjacent columns are packed into a single 32-bit element during the load
+
+For more information, refer to the
+`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-ld>`__.
+
+
+'``llvm.nvvm.tcgen05.st.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.tcgen05.st.<shape>.<num>(ptr addrspace(6) %tmem_addr, <n x i32> %args, i1 %unpack)
+
+  declare void @llvm.nvvm.tcgen05.st.16x32bx2.<num>(ptr addrspace(6) %tmem_addr, <n x i32> %args, i64 %offset, i1 %unpack)
+
+Overview:
+"""""""""
+
+This group of intrinsics asynchronously store data from the source vector into the Tensor Memory at the location
+specified by the 32-bit address operand 'tmem_addr` collectively across all threads of the warps.
+
+All the threads in the warp must specify the same value of `tmem_addr`, which must be the base address of the
+collective load operation. Otherwise, the behavior is undefined.
+
+The `shape` qualifier and the `num` qualifier together determines the total dimension of the data ('n') which
+is loaded from the Tensor Memory. The `shape` qualifier indicates the base dimension of data. The `num` qualifier
+indicates the repeat factor on the base dimension resulting in the total dimension of the data that is accessed.
+
+Allowed values for the 'num' are `x1, x2, x4, x8, x16, x32, x64, x128`.
+
+Allowed values for the 'shape' in the first intrinsic are `16x64b, 16x128b, 16x256b, 32x32b`.
+
+Allowed value for the 'shape' in the second intrinsic is `16x32bx2`.
+
+`args` argument is a vector consisting of one or more 32-bit registers derived from `shape` and
+`num` as listed in the table listed in the `tcgen05.ld` section.
+
+Each shape support an `unpack` mode to allow a 32-bit element in the register to be unpacked into two 16-bit elements and store them in adjacent columns. `unpack` mode can be enabled by setting the `%unpack` operand to 1 and can be disabled by setting it to 0.
+
+The last argument `i1 %unpack` is a compile-time constant which when set, indicates that a 32-bit element in the register to be unpacked into two 16-bit elements and store them in adjacent columns.
+
+For more information, refer to the
+`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-st>`__.
+
 Other Intrinsics
 ----------------
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 14ecae41ff08f92..62239ca705b9e24 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -564,6 +564,7 @@ def llvm_v8i32_ty      : LLVMType<v8i32>;    //  8 x i32
 def llvm_v16i32_ty     : LLVMType<v16i32>;   // 16 x i32
 def llvm_v32i32_ty     : LLVMType<v32i32>;   // 32 x i32
 def llvm_v64i32_ty     : LLVMType<v64i32>;   // 64 x i32
+def llvm_v128i32_ty    : LLVMType<v128i32>;  //128 x i32
 def llvm_v256i32_ty    : LLVMType<v256i32>;  //256 x i32
 
 def llvm_v1i64_ty      : LLVMType<v1i64>;    //  1 x i64
diff --git a/llvm/include/llvm/IR/IntrinsicsNVVM.td b/llvm/include/llvm/IR/IntrinsicsNVVM.td
index 6af1f2a1667734c..f6bfa575c8e01c4 100644
--- a/llvm/include/llvm/IR/IntrinsicsNVVM.td
+++ b/llvm/include/llvm/IR/IntrinsicsNVVM.td
@@ -656,6 +656,35 @@ class CP_ASYNC_BULK_TENSOR_REDUCE_INTR<int dim, string mode, string op> {
         ImmArg<ArgIndex<FlagsStartIdx>>];
 }
 
+class NVVM_TCGEN05_LDST_NAME<string Op, string Shape, int Num> {
+  string intr =   "llvm.nvvm.tcgen05." # Op
+                  # "." # Shape
+                  # "." # "x" # !shl(1, Num);
+
+  string record = !subst(".", "_",
+                  !subst("llvm.", "int_", intr));
+}
+
+
+class NVVM_TCGEN05_LDST_ACCESS_SIZE<string Shape, int Num> {
+  int shift = !cond(!eq(Shape, "16x128b"): 1,
+                    !eq(Shape, "16x256b"): 2,
+                    true : 0);
+
+  int veclen = !shl(1, !add(Num, shift));
+
+  int valid = !le(veclen, 128);
+  LLVMType type = !cond(!eq(veclen,   1): llvm_i32_ty,
+                        !eq(veclen,   2): llvm_v2i32_ty,
+                        !eq(veclen,   4): llvm_v4i32_ty,
+                        !eq(veclen,   8): llvm_v8i32_ty,
+                        !eq(veclen,  16): llvm_v16i32_ty,
+                        !eq(veclen,  32): llvm_v32i32_ty,
+                        !eq(veclen,  64): llvm_v64i32_ty,
+                        !eq(veclen, 128): llvm_v128i32_ty,
+                        true : llvm_void_ty);
+}
+
 let TargetPrefix = "nvvm" in {
   def int_nvvm_prmt : ClangBuiltin<"__nvvm_prmt">,
       DefaultAttrsIntrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
@@ -5138,4 +5167,40 @@ def int_nvvm_tcgen05_fence_before_thread_sync : Intrinsic<[], [],
 def int_nvvm_tcgen05_fence_after_thread_sync : Intrinsic<[], [],
   [IntrNoMem, IntrHasSideEffects]>;
 
+// Tcgen05 ld
+class NVVM_TCGEN05_LD<string Shape, int Num> :
+        Intrinsic<[NVVM_TCGEN05_LDST_ACCESS_SIZE<Shape, Num>.type],
+                  !listconcat([llvm_tmem_ptr_ty],
+                              !if(!eq(Shape, "16x32bx2"), [llvm_i64_ty], []),
+                              [llvm_i1_ty]),
+                  !listconcat([IntrConvergent, IntrArgMemOnly, NoCapture<ArgIndex<0>>],
+                              !if(!eq(Shape, "16x32bx2"),
+                                [ImmArg<ArgIndex<1>>, ImmArg<ArgIndex<2>>],
+                                [ImmArg<ArgIndex<1>>])),
+                  NVVM_TCGEN05_LDST_NAME<"ld", Shape, Num>.intr>;
+
+// Tcgen05 st
+class NVVM_TCGEN05_ST<string Shape, int Num> :
+        Intrinsic<[],
+                  !listconcat([llvm_tmem_ptr_ty],
+                              !if(!eq(Shape, "16x32bx2"), [llvm_i64_ty], []),
+                              [NVVM_TCGEN05_LDST_ACCESS_SIZE<Shape, Num>.type],
+                              [llvm_i1_ty]),
+                  !listconcat([IntrConvergent, IntrArgMemOnly, NoCapture<ArgIndex<0>>],
+                              !if(!eq(Shape, "16x32bx2"),
+                                [ImmArg<ArgIndex<1>>, ImmArg<ArgIndex<3>>],
+                                [ImmArg<ArgIndex<2>>])),
+                  NVVM_TCGEN05_LDST_NAME<"st", Shape, Num>.intr>;
+
+foreach shape = ["16x64b", "16x128b", "16x256b", "32x32b", "16x32bx2"] in {
+  foreach num = !range(0, 8) in {
+    if NVVM_TCGEN05_LDST_ACCESS_SIZE<shape, num>.valid then {
+      def NVVM_TCGEN05_LDST_NAME<"ld", shape, num>.record:
+            NVVM_TCGEN05_LD<shape, num>;
+      def NVVM_TCGEN05_LDST_NAME<"st", shape, num>.record:
+            NVVM_TCGEN05_ST<shape, num>;
+      }
+  }
+}
+
 } // let TargetPrefix = "nvvm"
diff --git a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
index e96c1758676a125..4a51316c2be665d 100644
--- a/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
+++ b/llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp
@@ -203,6 +203,109 @@ void NVPTXDAGToDAGISel::Select(SDNode *N) {
   SelectCode(N);
 }
 
+#define TCGEN05_LD_OPCODE(SHAPE, NUM)                                          \
+  (enablePack ? NVPTX::TCGEN05_LD_##SHAPE##_##NUM##_PACK                       \
+              : NVPTX::TCGEN05_LD_##SHAPE##_##NUM)
+
+static unsigned getTcgen05LdOpcode(unsigned IID, bool enablePack) {
+  switch (IID) {
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x1:
+    return TCGEN05_LD_OPCODE(16x64b, x1);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x2:
+    return TCGEN05_LD_OPCODE(16x64b, x2);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x4:
+    return TCGEN05_LD_OPCODE(16x64b, x4);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x8:
+    return TCGEN05_LD_OPCODE(16x64b, x8);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x16:
+    return TCGEN05_LD_OPCODE(16x64b, x16);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x32:
+    return TCGEN05_LD_OPCODE(16x64b, x32);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x64:
+    return TCGEN05_LD_OPCODE(16x64b, x64);
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x128:
+    return TCGEN05_LD_OPCODE(16x64b, x128);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x1:
+    return TCGEN05_LD_OPCODE(16x128b, x1);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x2:
+    return TCGEN05_LD_OPCODE(16x128b, x2);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x4:
+    return TCGEN05_LD_OPCODE(16x128b, x4);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x8:
+    return TCGEN05_LD_OPCODE(16x128b, x8);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x16:
+    return TCGEN05_LD_OPCODE(16x128b, x16);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x32:
+    return TCGEN05_LD_OPCODE(16x128b, x32);
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x64:
+    return TCGEN05_LD_OPCODE(16x128b, x64);
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x1:
+    return TCGEN05_LD_OPCODE(16x256b, x1);
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x2:
+    return TCGEN05_LD_OPCODE(16x256b, x2);
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x4:
+    return TCGEN05_LD_OPCODE(16x256b, x4);
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x8:
+    return TCGEN05_LD_OPCODE(16x256b, x8);
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x16:
+    return TCGEN05_LD_OPCODE(16x256b, x16);
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x32:
+    return TCGEN05_LD_OPCODE(16x256b, x32);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x1:
+    return TCGEN05_LD_OPCODE(16x32bx2, x1);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x2:
+    return TCGEN05_LD_OPCODE(16x32bx2, x2);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x4:
+    return TCGEN05_LD_OPCODE(16x32bx2, x4);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x8:
+    return TCGEN05_LD_OPCODE(16x32bx2, x8);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x16:
+    return TCGEN05_LD_OPCODE(16x32bx2, x16);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x32:
+    return TCGEN05_LD_OPCODE(16x32bx2, x32);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x64:
+    return TCGEN05_LD_OPCODE(16x32bx2, x64);
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x128:
+    return TCGEN05_LD_OPCODE(16x32bx2, x128);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x1:
+    return TCGEN05_LD_OPCODE(32x32b, x1);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x2:
+    return TCGEN05_LD_OPCODE(32x32b, x2);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x4:
+    return TCGEN05_LD_OPCODE(32x32b, x4);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x8:
+    return TCGEN05_LD_OPCODE(32x32b, x8);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x16:
+    return TCGEN05_LD_OPCODE(32x32b, x16);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x32:
+    return TCGEN05_LD_OPCODE(32x32b, x32);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x64:
+    return TCGEN05_LD_OPCODE(32x32b, x64);
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x128:
+    return TCGEN05_LD_OPCODE(32x32b, x128);
+  }
+  llvm_unreachable("unhandled tcgen05.ld lowering");
+}
+
+void NVPTXDAGToDAGISel::SelectTcgen05Ld(SDNode *N, bool hasOffset) {
+  SDLoc DL(N);
+  unsigned IID = cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
+
+  if (hasOffset) {
+    bool enablePack = cast<ConstantSDNode>(N->getOperand(4))->getZExtValue();
+    auto OffsetNode = CurDAG->getTargetConstant(
+        cast<ConstantSDNode>(N->getOperand(3))->getZExtValue(), DL, MVT::i32);
+    ReplaceNode(N, CurDAG->getMachineNode(
+                       getTcgen05LdOpcode(IID, enablePack), DL, N->getVTList(),
+                       {N->getOperand(2), OffsetNode, N->getOperand(0)}));
+  } else {
+    bool enablePack = cast<ConstantSDNode>(N->getOperand(3))->getZExtValue();
+    ReplaceNode(N, CurDAG->getMachineNode(
+                       getTcgen05LdOpcode(IID, enablePack), DL, N->getVTList(),
+                       {N->getOperand(2), N->getOperand(0)}));
+  }
+}
+
 bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
   unsigned IID = N->getConstantOperandVal(1);
   switch (IID) {
@@ -212,6 +315,51 @@ bool NVPTXDAGToDAGISel::tryIntrinsicChain(SDNode *N) {
   case Intrinsic::nvvm_ldu_global_i:
   case Intrinsic::nvvm_ldu_global_p:
     return tryLDGLDU(N);
+
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x1:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x2:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x4:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x8:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x16:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x32:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x64:
+  case Intrinsic::nvvm_tcgen05_ld_16x64b_x128:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x1:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x2:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x4:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x16:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x32:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x64:
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x1:
+  case Intrinsic::nvvm_tcgen05_ld_16x128b_x8:
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x2:
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x4:
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x8:
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x16:
+  case Intrinsic::nvvm_tcgen05_ld_16x256b_x32:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x1:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x2:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x4:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x8:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x16:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x32:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x64:
+  case Intrinsic::nvvm_tcgen05_ld_32x32b_x128: {
+    SelectTcgen05Ld(N);
+    return true;
+  }
+
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x1:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x2:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x4:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x8:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x16:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x32:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x64:
+  case Intrinsic::nvvm_tcgen05_ld_16x32bx2_x128: {
+    SelectTcgen05Ld(N, /* hasOffset */ true);
+    return true;
+  }
   }
 }
 
@@ -3227,6 +3375,115 @@ void NVPTXDAGToDAGISel::SelectCpAsyncBulkPrefetchL2(SDNode *N) {
   ReplaceNode(N, CurDAG->getMachineNode(Opcode, DL, N->getVTList(), Ops));
 }
 
+#define TCGEN05_ST_OPCODE(SHAPE, NUM)                                          \
+  (enableUnpack ? NVPTX::TCGEN05_ST_##SHAPE##_##NUM##_UNPACK                   \
+                : NVPTX::TCGEN05_ST_##SHAPE##_##NUM)
+
+static unsigned getTcgen05StOpcode(unsigned IID, bool enableUnpack) {
+  switch (IID) {
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x1:
+    return TCGEN05_ST_OPCODE(16x64b, x1);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x2:
+    return TCGEN05_ST_OPCODE(16x64b, x2);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x4:
+    return TCGEN05_ST_OPCODE(16x64b, x4);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x8:
+    return TCGEN05_ST_OPCODE(16x64b, x8);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x16:
+    return TCGEN05_ST_OPCODE(16x64b, x16);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x32:
+    return TCGEN05_ST_OPCODE(16x64b, x32);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x64:
+    return TCGEN05_ST_OPCODE(16x64b, x64);
+  case Intrinsic::nvvm_tcgen05_st_16x64b_x128:
+    return TCGEN05_ST_OPCODE(16x64b, x128);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x1:
+    return TCGEN05_ST_OPCODE(16x128b, x1);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x2:
+    return TCGEN05_ST_OPCODE(16x128b, x2);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x4:
+    return TCGEN05_ST_OPCODE(16x128b, x4);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x8:
+    return TCGEN05_ST_OPCODE(16x128b, x8);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x16:
+    return TCGEN05_ST_OPCODE(16x128b, x16);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x32:
+    return TCGEN05_ST_OPCODE(16x128b, x32);
+  case Intrinsic::nvvm_tcgen05_st_16x128b_x64:
+    return TCGEN05_ST_OPCODE(16x128b, x64);
+  case Intrinsic::nvvm_tcgen05_st_16x256b_x1:
+    return TCGEN05_ST_OPCODE(16x256b, x1);
+  case Intrinsic::nvvm_tcgen05_st_16x256b_x2:
+    return TCGEN05_ST_OPCODE(16x256b, x2);
+  case Intrinsic::nvvm_tcgen05_st_16x256b_x4:
+    return TCGEN05_ST_OPCODE(16x256b, x4);
+  case Intrinsic::nvvm_tcgen05_st_16x256b_x8:
+    return TCGEN05_ST_OPCODE(16x256b, x8);
+  case Intrinsic::nvvm_tcgen05_st_16x256b_x16:
+    return TCGEN05_ST_OPCODE(16x256b, x16);
+  case Intrinsic::nvvm_tcgen05_st_16x256b_x32:
+    return TCGEN05_ST_OPCODE(16x256b, x32);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x1:
+    return TCGEN05_ST_OPCODE(16x32bx2, x1);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x2:
+    return TCGEN05_ST_OPCODE(16x32bx2, x2);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x4:
+    return TCGEN05_ST_OPCODE(16x32bx2, x4);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x8:
+    return TCGEN05_ST_OPCODE(16x32bx2, x8);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x16:
+    return TCGEN05_ST_OPCODE(16x32bx2, x16);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x32:
+    return TCGEN05_ST_OPCODE(16x32bx2, x32);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x64:
+    return TCGEN05_ST_OPCODE(16x32bx2, x64);
+  case Intrinsic::nvvm_tcgen05_st_16x32bx2_x128:
+    return TCGEN05_ST_OPCODE(16x32bx2, x128);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x1:
+    return TCGEN05_ST_OPCODE(32x32b, x1);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x2:
+    return TCGEN05_ST_OPCODE(32x32b, x2);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x4:
+    return TCGEN05_ST_OPCODE(32x32b, x4);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x8:
+    return TCGEN05_ST_OPCODE(32x32b, x8);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x16:
+    return TCGEN05_ST_OPCODE(32x32b, x16);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x32:
+    return TCGEN05_ST_OPCODE(32x32b, x32);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x64:
+    return TCGEN05_ST_OPCODE(32x32b, x64);
+  case Intrinsic::nvvm_tcgen05_st_32x32b_x128:
+    return TCGEN05_ST_OPCODE(32x32b, x128);
+  }
+  llvm_unreachable("unhandled tcgen05.st lowering");
+}
+
+void NVPTXDAGToDAGISel::SelectTcgen05St(SDNode *N, bool hasOffset) {
+  SDLoc DL(N);
+  unsigned IID = cast<ConstantSDNode>(N->get...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/126740