[llvm] [LLVM][NVPTX] Add codegen support for tcgen05.{ld, st} instructions (PR #126740)

Fri Feb 14 10:53:00 PST 2025

================
@@ -1175,6 +1175,101 @@ For more information, refer to the PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensorcore-5th-generation-instructions-tcgen05-fence>`_.
 
 
+'``llvm.nvvm.tcgen05.ld.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare <n x i32> @llvm.nvvm.tcgen05.ld.<shape>.<num>(ptr addrspace(6) %tmem_addr, i1 %pack)
+
+  declare <n x i32> @llvm.nvvm.tcgen05.ld.16x32bx2.<num>(ptr addrspace(6) %tmem_addr, i64 %offset, i1 %pack)
----------------
Artem-B wrote:

Single vector return type is more convenient, but instruction does have two outputs. We need to stitch them together into a vector. I don't think tablegen dag has a way to represent a diamond, so that would need custom lowering.
If the vector has a legal single-register representation, and the instruction accepts that register as a destination, then we could use tablegen. I believe we already have patterns for v2f16 passed via i32 register, so pattern matching should work on vectors in principle.

Returning two values is also not a big deal. Granted, it's somewhat unusual, but for higher-level uses, we could provide some sort of wrapper that would stitch scalar f32 into v2f32.


https://github.com/llvm/llvm-project/pull/126740