[llvm] [NVVM][NVPTX] Add support for tcgen05.mma (PR #151949)

Tue Sep 23 06:13:11 PDT 2025

================
@@ -1945,6 +1945,464 @@ The last argument `i1 %unpack` is a compile-time constant which when set, indica
 For more information, refer to the
 `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-st>`__.
 
+tcgen05.mma Intrinsics
+----------------------
+
+Overview
+^^^^^^^^
+
+`tcgen05.mma` operation of shape `M x N x K` perform matrix multiplication and
+accumulation of the form: `D =  A * B + D` where:
+
+  - the `A` matrix has shape `M x K`, in either `Tensor Memory` or `Shared Memory`
+  - the `B` matrix has shape `K x N`, in `Shared Memory` of the current CTA and, optionally in peer CTA
+  - the `D` matrix is of the shape `M x N`, in `Tensor Memory`
+
+Optionally an input predicate can be used to disable the input (`%enable_inp_d`)
+from the accumulator matrix and the following operation can be performed as `D = A * B`
+
+The matrix multiplication and accumulation operations are categorized into various
+kinds based on input types and the throughput of the multiplication operation.
+The following table shows the different kinds of MMA operations that are supported:
+
++------------+--------------------------------------------+
+| .kind      | Supported Input Types                      |
++============+============================================+
+| f16        | F16 and BF16                               |
++------------+--------------------------------------------+
+| tf32       | TF32                                       |
++------------+--------------------------------------------+
+| f8f6f4     | All combinations of F8, F6, and F4         |
++------------+--------------------------------------------+
+| i8         | Signed and Unsigned 8-bit Integers         |
++------------+--------------------------------------------+
+| mxf8f6f4   | MX-floating point formats                  |
++------------+--------------------------------------------+
+| mxf4       | MX-floating point formats (FP4)            |
++------------+--------------------------------------------+
+| mxf4nvf4   | MXF4 + custom NVIDIA 4-bit floating point  |
+|            | (with common scaling factor)               |
++------------+--------------------------------------------+
+
+`tcgen05.mma.sp` supports sparse variant of `A` with shape `M x K` stored in packed
+form as `M X (K / 2)` in memory. The `%spmetadata` specifies the mapping of the
+`K / 2` non-zero elements to the `K` elements before performing the MMA operation.
+
+`tcgen05.mma.block_scale` perform matrix multiplication with block scaling
+`D = (A * scale_A)  * (B * scale_B) + D` where scaling of input matrices from
+memory to form the matrix `A` and matrix `B` before performing the MMA operation.
+Scale factors for `A` and `B` matrices need to be duplicated to all 32 lane partitions
+of tensor memory. The shape of `%scale_a` and `%scale_b` matrices depend on the
+`.scale_vectorsize` described in `here <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-scale-valid-comb>`__
+
+The sparsity metadata (`%spmetadata`) as well as the block-scale inputs for `A / B`
+matrices (`%scale_a` and `%scale_b`) reside in Tensor Memory.
+
+To facilitate opportunistic re-use of `A / B` matrix data across a sequence of MMA
+operations, the `A/B` matrices are loaded into a collector buffer
+(`%collector_usage_a_op_flag`, `%collector_usage_b_buffer_flag`, and `%collector_usage_b_op_flag`).
+The flag value of the collector_usage flag in the intrinsic specifies the nature of the re-use
+
+There are three kinds of matrix descriptors used by the tcgen05 family of instructions:
+
++----------------------------+-----------------------------------------------------------------------------------------------------------+-------------+
+| Descriptor                 | Description                                                                                               | Size (bits) |
++============================+===========================================================================================================+=============+
+| Shared Memory Descriptor   | Describes properties of multiplicand matrix                                                               |             |
+|                            | in shared memory, including its location                                                                  |             |
+|                            | within the CTA's shared memory.                                                                           |     64      |
+|                            | `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-shared-memory-descriptor>`__    |             |
++----------------------------+-----------------------------------------------+-------------+---------------------------------------------+-------------+
+| Instruction Descriptor     | Describes shapes, types, and details of                                                                   |             |
+|                            | all matrices and the MMA operation.                                                                       |     32      |
+|                            | `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-zero-column-mask-descriptor>`__ |             |
++----------------------------+-----------------------------------------------+-------------+---------------------------------------------+-------------+
+| Zero-Column Mask Descriptor| Generates a mask specifying which columns of                                                              |             |
+|                            | B matrix are zeroed in the MMA operation,                                                                 |             |
+|                            | regardless of values in shared memory.                                                                    |     64      |
+|                            | Total mask size = N bits                                                                                  |             |
+|                            | `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instruction-descriptor>`__      |             |
++----------------------------+-----------------------------------------------+-------------+---------------------------------------------+-------------+
+
+`tcgen05.mma` can be used for general matrix multiplication or for convolution operations.
+In case of convolutions, the `activations` can be stored in either matrix `A` or matrix `B`
+while the `weights` will be stored in the other matrix
+
+`tcgen05.mma` has an optional collector qualifier to specify when an `A` or `B` matrix
+is new to the sequence and should be loaded, unchanged within the sequence and,
+should be reused, or the last use in the sequence and should be discarded.
+The collector qualifier is used to give the TensorCore permission to reuse a
+previously loaded `A` or `B` matrix; however reuse is opportunistic in that the
+TensorCore may reload a matrix even when it has permission to reuse that matrix.
+Thus, the source memory of an A or B matrix must not be modified while the MMA
+instruction using those matrices has not completed - regardless of collector
+qualifier permissions.
+
+The `cta_group::1` specifies that the operation is performed on the Tensor Memory
+of the executing thread’s CTA only. The `cta_group::2` specifies that the MMA
+operation is performed on the Tensor Memory of the executing thread’s CTA and its peer CTA.
+
+The vector operand `%disable_output_lane` specifies the lane(s) in the Tensor Memory
+that should be not be updated with the resultant matrix D. Elements of the vector operand
+disable-output-lane forms a mask where each bit corresponds to a lane of the Tensor Memory,
+with least significant bit of the first element of the vector (leftmost in syntax)
+corresponding to the lane 0 of the Tensor Memory. If a bit in the mask is 1, then
+the corresponding lane in the Tensor Memory for the resultant matrix D will not be
+updated
+
+Intrinsic Design:
+^^^^^^^^^^^^^^^^^
+
+Given the broad feature set of `tcgen05.mma` instruction modeling these
+through intrinsics is highly complex, and the following table outlines the large
+number of intrinsics required to fully support the `tcgen05.mma` instruction set.
----------------
schwarzschild-radius wrote:

I agree with you, Artem! Durga and I had a chat about it as well. We are tyring to figure out the best way forward as adding intrinsics is consuming a lot of time compared to inline-asm in supporting complex PTX instructions

https://github.com/llvm/llvm-project/pull/151949