[llvm] [LLVM][NVPTX] Add codegen support for tcgen05.{ld, st} instructions (PR #126740)

Thu Feb 13 15:00:46 PST 2025

================
@@ -1175,6 +1175,101 @@ For more information, refer to the PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensorcore-5th-generation-instructions-tcgen05-fence>`_.
 
 
+'``llvm.nvvm.tcgen05.ld.*``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare <n x i32> @llvm.nvvm.tcgen05.ld.<shape>.<num>(ptr addrspace(6) %tmem_addr, i1 %pack)
+
+  declare <n x i32> @llvm.nvvm.tcgen05.ld.16x32bx2.<num>(ptr addrspace(6) %tmem_addr, i64 %offset, i1 %pack)
----------------
Artem-B wrote:

I think the problem with your earlier patch is that the intrinsic is declared returning `llvm_v2i32_ty`, but the pattern wants to set two scalars.
```
def int_nvvm_tcgen05_ld_16x64b_x2 {     // SDPatternOperator Intrinsic NVVM_TCGEN05_LD
  list<LLVMType> RetTypes = [llvm_v2i32_ty];
  list<LLVMType> ParamTypes = [llvm_tmem_ptr_ty];
}
...
def TCGEN05_LD_16x64b_x2 {      // InstructionEncoding Instruction NVPTXInst Requires NVVM_TCGEN05_LD_INST
  dag OutOperandList = (outs Int32Regs:$r0, Int32Regs:$r1);
  dag InOperandList = (ins Int32Regs:$taddr);
  list<dag> Pattern = [(set Int32Regs:$r0, Int32Regs:$r1, (int_nvvm_tcgen05_ld_16x64b_x2 Int32Regs:$taddr))];
...
```

In order to work, intrinsic return type should've been `[llvm_i32_ty, llvm_i32_ty]`. Or, alternatively, the pattern would need to change so that it matches the single `llvm_v2i32_ty` value returned by the instruction. I think you want the former.

https://github.com/llvm/llvm-project/pull/126740