[llvm] [NVPTX] Add TMA bulk tensor reduction intrinsics (PR #116854)

Fri Nov 22 11:21:49 PST 2024

================
@@ -663,6 +663,84 @@ the same functionality as described in the ``tile`` mode intrinsics above.
 For more information, refer PTX ISA
 `<https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor>`_.
 
+'``llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch, i8 %flag_red_op)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.2d(..., i32 %d0, i32 %d1, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+  declare void @llvm.nvvm.cp.async.bulk.tensor.reduce.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.reduce.tile.[1-5]d``' intrinsics
+correspond to the ``cp.reduce.async.bulk.tensor.[1-5]d.*`` set of PTX instructions.
+These instructions initiate an asynchronous reduction operation of tensor data
+in global memory with the tensor data in shared{::cta} memory, using ``tile`` mode.
+The dimension of the tensor data ranges from 1d to 5d with the coordinates
+specified by the ``i32 %d0 ... i32 %d4`` arguments.
+
+* The last two arguments to these intrinsics are flags.
+  These flag arguments must be compile-time constants. The backend
+  looks through these flags and lowers the intrinsics appropriately.
+
+* The Nth argument (denoted by ``i8 flag_red_op``) indicates the
+  kind of reduction operation performed. The argument must be in
+  the range [0, 7], representing the following reduction operations:
+
+  ========== =============
+  Enum Value  Reduction Op
+  ========== =============
+  ``0``       ADD
+  ``1``       MIN
+  ``2``       MAX
+  ``3``       INC
+  ``4``       DEC
+  ``5``       AND
+  ``6``       OR
+  ``7``       XOR
+  ========== =============
----------------
Artem-B wrote:

This particular argument should probably be part of the intrinsic name, as it describes *what* we want to do, not the parameters of the particular operation.



https://github.com/llvm/llvm-project/pull/116854