[llvm] [NVVM][NVPTX] Add support for tcgen05.mma (PR #151949)
Pradeep Kumar via llvm-commits
llvm-commits at lists.llvm.org
Tue Sep 23 06:13:11 PDT 2025
================
@@ -1945,6 +1945,464 @@ The last argument `i1 %unpack` is a compile-time constant which when set, indica
For more information, refer to the
`PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instructions-tcgen05-st>`__.
+tcgen05.mma Intrinsics
+----------------------
+
+Overview
+^^^^^^^^
+
+`tcgen05.mma` operation of shape `M x N x K` perform matrix multiplication and
+accumulation of the form: `D = A * B + D` where:
+
+ - the `A` matrix has shape `M x K`, in either `Tensor Memory` or `Shared Memory`
+ - the `B` matrix has shape `K x N`, in `Shared Memory` of the current CTA and, optionally in peer CTA
+ - the `D` matrix is of the shape `M x N`, in `Tensor Memory`
+
+Optionally an input predicate can be used to disable the input (`%enable_inp_d`)
+from the accumulator matrix and the following operation can be performed as `D = A * B`
+
+The matrix multiplication and accumulation operations are categorized into various
+kinds based on input types and the throughput of the multiplication operation.
+The following table shows the different kinds of MMA operations that are supported:
+
++------------+--------------------------------------------+
+| .kind | Supported Input Types |
++============+============================================+
+| f16 | F16 and BF16 |
++------------+--------------------------------------------+
+| tf32 | TF32 |
++------------+--------------------------------------------+
+| f8f6f4 | All combinations of F8, F6, and F4 |
++------------+--------------------------------------------+
+| i8 | Signed and Unsigned 8-bit Integers |
++------------+--------------------------------------------+
+| mxf8f6f4 | MX-floating point formats |
++------------+--------------------------------------------+
+| mxf4 | MX-floating point formats (FP4) |
++------------+--------------------------------------------+
+| mxf4nvf4 | MXF4 + custom NVIDIA 4-bit floating point |
+| | (with common scaling factor) |
++------------+--------------------------------------------+
+
+`tcgen05.mma.sp` supports sparse variant of `A` with shape `M x K` stored in packed
+form as `M X (K / 2)` in memory. The `%spmetadata` specifies the mapping of the
+`K / 2` non-zero elements to the `K` elements before performing the MMA operation.
+
+`tcgen05.mma.block_scale` perform matrix multiplication with block scaling
+`D = (A * scale_A) * (B * scale_B) + D` where scaling of input matrices from
+memory to form the matrix `A` and matrix `B` before performing the MMA operation.
+Scale factors for `A` and `B` matrices need to be duplicated to all 32 lane partitions
+of tensor memory. The shape of `%scale_a` and `%scale_b` matrices depend on the
+`.scale_vectorsize` described in `here <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-mma-scale-valid-comb>`__
+
+The sparsity metadata (`%spmetadata`) as well as the block-scale inputs for `A / B`
+matrices (`%scale_a` and `%scale_b`) reside in Tensor Memory.
+
+To facilitate opportunistic re-use of `A / B` matrix data across a sequence of MMA
+operations, the `A/B` matrices are loaded into a collector buffer
+(`%collector_usage_a_op_flag`, `%collector_usage_b_buffer_flag`, and `%collector_usage_b_op_flag`).
+The flag value of the collector_usage flag in the intrinsic specifies the nature of the re-use
+
+There are three kinds of matrix descriptors used by the tcgen05 family of instructions:
+
++----------------------------+-----------------------------------------------------------------------------------------------------------+-------------+
+| Descriptor | Description | Size (bits) |
++============================+===========================================================================================================+=============+
+| Shared Memory Descriptor | Describes properties of multiplicand matrix | |
+| | in shared memory, including its location | |
+| | within the CTA's shared memory. | 64 |
+| | `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-shared-memory-descriptor>`__ | |
++----------------------------+-----------------------------------------------+-------------+---------------------------------------------+-------------+
+| Instruction Descriptor | Describes shapes, types, and details of | |
+| | all matrices and the MMA operation. | 32 |
+| | `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-zero-column-mask-descriptor>`__ | |
++----------------------------+-----------------------------------------------+-------------+---------------------------------------------+-------------+
+| Zero-Column Mask Descriptor| Generates a mask specifying which columns of | |
+| | B matrix are zeroed in the MMA operation, | |
+| | regardless of values in shared memory. | 64 |
+| | Total mask size = N bits | |
+| | `PTX ISA <https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-instruction-descriptor>`__ | |
++----------------------------+-----------------------------------------------+-------------+---------------------------------------------+-------------+
+
+`tcgen05.mma` can be used for general matrix multiplication or for convolution operations.
+In case of convolutions, the `activations` can be stored in either matrix `A` or matrix `B`
+while the `weights` will be stored in the other matrix
+
+`tcgen05.mma` has an optional collector qualifier to specify when an `A` or `B` matrix
+is new to the sequence and should be loaded, unchanged within the sequence and,
+should be reused, or the last use in the sequence and should be discarded.
+The collector qualifier is used to give the TensorCore permission to reuse a
+previously loaded `A` or `B` matrix; however reuse is opportunistic in that the
+TensorCore may reload a matrix even when it has permission to reuse that matrix.
+Thus, the source memory of an A or B matrix must not be modified while the MMA
+instruction using those matrices has not completed - regardless of collector
+qualifier permissions.
+
+The `cta_group::1` specifies that the operation is performed on the Tensor Memory
+of the executing thread’s CTA only. The `cta_group::2` specifies that the MMA
+operation is performed on the Tensor Memory of the executing thread’s CTA and its peer CTA.
+
+The vector operand `%disable_output_lane` specifies the lane(s) in the Tensor Memory
+that should be not be updated with the resultant matrix D. Elements of the vector operand
+disable-output-lane forms a mask where each bit corresponds to a lane of the Tensor Memory,
+with least significant bit of the first element of the vector (leftmost in syntax)
+corresponding to the lane 0 of the Tensor Memory. If a bit in the mask is 1, then
+the corresponding lane in the Tensor Memory for the resultant matrix D will not be
+updated
+
+Intrinsic Design:
+^^^^^^^^^^^^^^^^^
+
+Given the broad feature set of `tcgen05.mma` instruction modeling these
+through intrinsics is highly complex, and the following table outlines the large
+number of intrinsics required to fully support the `tcgen05.mma` instruction set.
----------------
schwarzschild-radius wrote:
I agree with you, Artem! Durga and I had a chat about it as well. We are tyring to figure out the best way forward as adding intrinsics is consuming a lot of time compared to inline-asm in supporting complex PTX instructions
https://github.com/llvm/llvm-project/pull/151949
More information about the llvm-commits
mailing list