[Mlir-commits] [mlir] [NFC][mlir][AMDGPU] Partition dialect .td into multiple files (PR #178562)

Wed Jan 28 17:47:28 PST 2026

llvmbot wrote:




@llvm/pr-subscribers-mlir

Author: Krzysztof Drewniak (krzysz00)

<details>
<summary>Changes</summary>

Follow the style of other dialects by having a distiinct .td file for each category of thing (type, attribdut, operation, enum) generated for the AMDGPU dialect.

Nothing has changed, but a lot of things have been copy-pasted.

---

Patch is 166.45 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/178562.diff


9 Files Affected:

- (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+5-1794) 
- (added) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUAttrs.td (+50) 
- (added) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUBase.td (+118) 
- (modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUDialect.h (+1) 
- (added) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUEnums.td (+83) 
- (added) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td (+1544) 
- (added) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUTypes.td (+72) 
- (modified) mlir/include/mlir/Dialect/AMDGPU/IR/CMakeLists.txt (+2-2) 
- (modified) mlir/lib/Dialect/AMDGPU/IR/AMDGPUDialect.cpp (+1-1) 


``````````diff

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index 1f0e5cf7e7f56..94f8d37609230 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -1,4 +1,4 @@
-//===-- AMDGPU.td - AMDGPU dialect definitions *- tablegen -*------===//
+//===-- AMDGPU.td - AMDGPU dialect *- tablegen -*------------------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
@@ -6,1798 +6,9 @@
 //
 //===----------------------------------------------------------------------===//
 
-#ifndef AMDGPU
-#define AMDGPU
+#ifndef MLIR_DIALECT_AMDGPU_IR_AMDGPU_TD
+#define MLIR_DIALECT_AMDGPU_IR_AMDGPU_TD
 
-include "mlir/Interfaces/InferTypeOpInterface.td"
-include "mlir/Interfaces/SideEffectInterfaces.td"
-include "mlir/Interfaces/ViewLikeInterface.td"
-include "mlir/IR/EnumAttr.td"
-include "mlir/IR/Properties.td"
-include "mlir/IR/OpBase.td"
+include "mlir/Dialect/AMDGPU/IR/AMDGPUOps.td"
 
-def AMDGPU_Dialect : Dialect {
-  let name = "amdgpu";
-  let cppNamespace = "::mlir::amdgpu";
-  let description = [{
-    The `AMDGPU` dialect provides wrappers around AMD-specific functionality
-    and LLVM intrinsics. These wrappers should be used in conjunction with
-    more generic dialects, such as `gpu` and `vector`, when generating LLVM IR
-    that will eventually be executed on AMD hardware.
-
-    # What goes here?
-    In many cases, AMD GPU functionality can be accessed either though generic
-    operations (such as those in the `gpu`, `vector`, or `math`) or through
-    the `rocdl` dialect's intrinsic wrappers. However, there are instances where
-    AMD-specific functionally benefits from a wrapper around the underlying
-    LLVM intrinsics.
-
-    In general terms, operations or types should be added to this dialect when they
-    wrap some AMD-specific functionality in a way that makes it work better with the
-    MLIR ecosystem and its types or when those buitins would be needlessly
-    complex to work with (such as if they features magic constants at the LLVM level).
-
-    An additional set of operations that belong in this dialect are those that
-    have chipset-specific differences that can be abstracted over in a useful way.
-
-    To give some concrete examples:
-
-    - `amdgpu.mfma` and `amdgpu.wmma` exist in order to make a large set of
-      intrinsics more compatible with the MLIR type system (such as by allowing
-      8-bit float vectors to be passed as `vector<N x f8E4M3FN>` or
-      `vector<N x f8E4M2>` instead of as packed 32-bit integers whose element type
-      is controlled by separate operator-level constants. These operations also
-      allow the same `amdgpu.mfma` operation to be used regardless of the target
-      chip.
-    - `amdgpu.swizzle_bitmode` provides a wrapper around the `ds.swizzle` intrinsic,
-      allowing a wider range of types (such as `vector<2xf16>`) to be used natively
-      and eliminating the need to pack the and, or, and xor components using opaque
-      shifts.
-    - Operations like `amdgpu.gather_to_lds` provide `memref`-ized wrappers around
-      intrinsics that take a pointer, and are nontrivial enough to justify inclusion
-      in this dialect.
-
-
-    Note that simple intrinsics like `rocdl.sin` or `rocdl.s.barrier` should not
-    receive wrapper operations, as nothing is gained from the duplicate operation.
-    As a rule of thumb, if an operation's rewrite in AMDGPUToROCDL would be only
-    a `replaceOpWithNewOp` call, no AMDGPU dialect operation is needed.
-
-    # Design guidelines
-
-    Operations should leverage MLIR's "standard" types where possible. MLIR has
-    a more extensible type system than LLVM (especially in the area of small floats)
-    and those types should be used to create more ergonomic wrappers. In particular,
-    intrinsics that take pointers should have wrappers in this dialect that take
-    `memref` arguments and indices.
-
-    Operations should use properties or attributes in cases where the underlying
-    intrinsic uses `immarg`s (except in cases where that attribute can be represented
-    in the type system).
-
-    If it is possible to generalize the types of an operation, it should be done.
-    For example, the underlying operations for permutations and swizzles always
-    take 32-bit operands. Their AMDGPU wrappers can take any type, and will apply
-    padding and expansion to multiple instructions as needed. This makes these
-    operations easier to target because it hides the bitcasts and extracts
-    until the final lowering.
-
-    When the underlying operation uses magic constants, those should be presented
-    in a more programmer-friendly fashion, such as through enums or though
-    using separate arguments that are later combined. (For example, see the
-    design of the `amdgpu.dpp` and `amdgpu.fat_raw_buffer_cast` operations.)
-
-    If sufficiently similar functionality on multiple hardware generations can be
-    encapsulated into a single operation, it should be done. The lowering to
-    intrinsics should either throw an error when an unsupported capability is
-    used or ignore it. Which of these is two failure modes is more appropriate
-    depends on the nature of the feature, but errors are a safe default choice.
-
-    # Documentation guidelines
-
-    AMDGPU dialect operations should document how any abstractions they introduce
-    translate to LLVM intrinsics or hardware operations.
-
-    While documenting the semantics of the underlying operations is not required,
-    is preferred to provide an overview of the operation's functionality,
-    especially in cases where the documentation is widely distributed. Someone
-    looking at an AMDGPU dialect operation should be able to generally understand
-    what it does and have found the keywords they'll need for more detail.
-
-    Operation documentation should include usage examples.
-
-    Note that this dialect uses LLVM's gfx numbers to refer to individual
-    architectures/chipsets and not product names or codenames.
-  }];
-
-
-  let dependentDialects = [
-    "ROCDL::ROCDLDialect",
-    "arith::ArithDialect",
-    "gpu::GPUDialect"
-  ];
-  let useDefaultAttributePrinterParser = 1;
-  let useDefaultTypePrinterParser = 1;
-}
-
-def AnyIntegerOrFloat : AnyTypeOf<[AnySignlessInteger, AnyFloat], "Integer or Float">;
-
-def AnyIntegerOrFloatOr1DVector :
-  AnyTypeOf<[AnyIntegerOrFloat, FixedVectorOfRankAndType<[1], [AnyIntegerOrFloat]>]>;
-
-//===----------------------------------------------------------------------===//
-// AMDGPU general attribute definitions
-//===----------------------------------------------------------------------===//
-
-def AMDGPU_AddressSpace : I32EnumAttr<"AddressSpace",
-    "AMDGPU-specific address spaces",
-    [
-      I32EnumAttrCase<"FatRawBuffer",        0, "fat_raw_buffer">,
-      I32EnumAttrCase<"BufferRsrc",          1, "buffer_rsrc">,
-      I32EnumAttrCase<"FatStructuredBuffer", 2, "fat_structured_buffer">,
-    ]> {
-  let genSpecializedAttr = 0;
-  let cppNamespace = "::mlir::amdgpu";
-}
-
-def AMDGPU_AddressSpaceAttr : EnumAttr<AMDGPU_Dialect, AMDGPU_AddressSpace,
-    "address_space"> {
-  let description = [{
-    AMDGPU-specific memory spaces that may not have exact analogues on other
-    GPU targets or backends.
-
-    - `fat_raw_buffer` is the memory space used when a memref is stored as
-    as a "buffer fat pointer" - that is, a buffer resource (that is set up to
-    use raw byte-level indexing) along with its offset. The AMDGPU backend
-    implements `ptr addrspace(7)` to represent these fat pointers so that
-    buffer resources (which allow advanced features like bounds checking or
-    cache swizzling) can be used like ordinary LLVM pointers or memrefs.
-    See also the `fat_raw_buffer_cast` operation
-    - `buffer_rsrc` is the memory space for `ptr addrspace(8)`, representing a
-    buffer resource. It should not be used for memrefs, since it does not support
-    indexing
-    - `fat_structured_buffer` represents `ptr addrspace(9)`, a buffer resource
-    that carries both an index and offset field, which are used for complex
-    structured indexing that is primarily seen in graphics applications. This
-    is also incompatible with the simple indexing model supported by memref.
-  }];
-  let assemblyFormat = "`<` $value `>`";
-}
-
-//===----------------------------------------------------------------------===//
-// AMDGPU Type definitions
-//===----------------------------------------------------------------------===//
-
-class AMDGPU_Type<string name, string typeMnemonic, list<Trait> traits = []>
-    : TypeDef<AMDGPU_Dialect, name, traits> {
-  let mnemonic = typeMnemonic;
-}
-
-def AMDGPU_TDMBaseType : AMDGPU_Type<"TDMBase", "tdm_base"> {
-  let summary = "Pair of base addresses that move data between LDS and global storage.";
-  let description = [{
-    This type is opaque and it is used to represent a struct of two addresses.
-    One address is in LDS while the other is in global memory.
-
-    The value defined by this operation is only intended to be used by
-    amdgpu.tdm_make_descriptor.
-  }];
-  let parameters = (ins "Type":$elementType);
-  let builders = [
-    TypeBuilderWithInferredContext<(ins "Type":$elementType), [{
-      return $_get(elementType.getContext(), elementType);
-    }]>
-  ];
-  let assemblyFormat = "`<` $elementType `>`";
-}
-
-def AMDGPU_TDMGatherBaseType : AMDGPU_Type<"TDMGatherBase", "tdm_gather_base"> {
-  let summary = "Pair of base addresses that move data between LDS and global storage.";
-  let description = [{
-    This type is opaque and it is used to represent a struct of two addresses.
-    One address is in LDS while the other is in global memory.
-
-    This operation is similar to amdgpu.tdm_make_base but intended to be
-    used in gather mode.
-
-    The value defined by this operation is only intended to be used by
-    amdgpu.tdm_make_gather_descriptor.
-  }];
-  let parameters = (ins "Type":$elementType, "Type":$indexType);
-  let builders = [
-    TypeBuilderWithInferredContext<(ins "Type":$elementType, "Type": $indexType), [{
-      return $_get(elementType.getContext(), elementType, indexType);
-    }]>
-  ];
-  let assemblyFormat = "`<` $elementType `,` $indexType`>`";
-  let genVerifyDecl = 1;
-}
-
-def AMDGPU_TDMDescriptorType : AMDGPU_Type<"TDMDescriptor", "tdm_descriptor"> {
-  let summary = "Descriptors used in tensor store/load operations.";
-  let description = [{
-    This type is opaque and corresponds to the two or four descriptor groups
-    used in tensor_load_to_lds or tensor_store_from_lds.
-  }];
-}
-
-class AMDGPU_ConcreteVector<Type elem, int length> :
-  FixedVectorOfLengthAndType<[length], [elem]>,
-  BuildableType<
-    "::mlir::VectorType::get({" # length # "} ,"
-      # elem.builderCall # ")">;
-
-//===----------------------------------------------------------------------===//
-// AMDGPU Op definitions
-//===----------------------------------------------------------------------===//
-
-class AMDGPU_Op<string mnemonic, list<Trait> traits = []> :
-  Op<AMDGPU_Dialect, mnemonic, traits> {}
-
-def AMDGPU_ExtPackedFp8Op :
-    AMDGPU_Op<"ext_packed_fp8", [Pure]>,
-    Arguments<(ins AnyTypeOf<[F8E5M2FNUZ, F8E4M3FNUZ, F8E5M2, F8E4M3FN,
-        VectorOfLengthAndType<[1, 2, 3, 4], [F8E5M2FNUZ, F8E4M3FNUZ, F8E5M2, F8E4M3FN]>]>:$source,
-      ConfinedAttr<I32Attr, [IntNonNegative, IntMaxValue<3>]>:$index)>,
-    Results<(outs AnyTypeOf<[F32, FixedVectorOfLengthAndType<[2], [F32]>]>:$res)> {
-  let summary = "Extend a fp8 value to a float or a vector of packed fp8 values to two floats";
-
-  let description = [{
-    Extend one or two 8-bit floats in `source[index]` to a 32-bit float or
-    two floats and return them.
-
-    This rather unusual signature arises from the fact that AMD GPUs cannot
-    easily work with sub 32-bit quantities, so the compiler intrinsics for
-    extending 8-bit floats (which are, currently, the only way to work with
-    this operation) take packed vectors of 4 such floats.
-
-    If the passed-in vector has fewer than four elements, or the input is scalar,
-    the remaining values in the <4 x i8> will be filled with
-    undefined values as needed.
-  }];
-  let assemblyFormat = [{
-    attr-dict $source `[` $index `]` `:` type($source) `to` type($res)
-  }];
-}
-
-def AMDGPU_ScaledExtPackedMatrixOp
-    : AMDGPU_Op<"scaled_ext_packed_matrix", [Pure, AllShapesMatch<["source", "res"]>]>,
-      Arguments<(
-          ins AnyTypeOf<[FixedVectorOfShapeAndType<[8], F4E2M1FN>,
-                         FixedVectorOfShapeAndType<[8], F8E4M3FN>,
-                         FixedVectorOfShapeAndType<[8], F8E5M2>,
-                         FixedVectorOfShapeAndType<[16], F6E2M3FN>,
-                         FixedVectorOfShapeAndType<[16], F6E3M2FN>]>:$source,
-          FixedVectorOfShapeAndType<[4], F8E8M0FNU>:$scale,
-          ConfinedAttr<I32Attr, [IntIsOneOf<[16, 32]>]>:$blockSize,
-          ConfinedAttr<I32Attr, [IntIsOneOf<[0, 16]>]>:$firstScaleLane,
-          ConfinedAttr<I32Attr, [IntMinValue<0>, IntMaxValue<3>]>:$firstScaleByte)>,
-      Results<(
-          outs AnyTypeOf<[FixedVectorOfShapeAndType<[8], F32>,
-                          FixedVectorOfShapeAndType<[8], F16>,
-                          FixedVectorOfShapeAndType<[8], BF16>,
-                          FixedVectorOfShapeAndType<[16], F32>,
-                          FixedVectorOfShapeAndType<[16], F16>,
-                          FixedVectorOfShapeAndType<[16], BF16>]>:$res)> {
-
-  let summary = "Extend a wave-wide matrix of packed floating point values";
-
-  let description = [{
-    Extend matrix of microfloats (8 or 16 elements per lane) using a set of scales
-    that may be stored on other lanes.
-
-    The scales applied to the input microfloats are stored in bytes which
-    come from the `scales` input provided in a *half* of the wave identified
-    by `firstScaleLane`. The bytes used is selected by `firstScaleByte` and depends
-    on the type of `source`. The 16 vectors in consecutive lanes starting from
-    `firstScaleLane` (which we'll call the scale vectors) will be used by both
-    halves of the wave (with lane L reading from L % 16'th scale vector).
-
-    When `source` is either F4E2M1FN, F6E2M3FN, or F6E3M2FN each half of the
-    wave will use a different byte. The first one being `firstScaleByte` and
-    the second one being `firstScaleByte` + 1. When the block size is 32,
-    `firstScaleByte` can be either 0 or 2, selecting halves of the scale vectors.
-    Lanes 0-15 will read from `firstScaleByte` and lanes 16-31 will read
-    from `firstScaleByte` + 1.
-
-
-    For example:
-    ```mlir
-    // Input: 8-element vector of F8E4M3FN, converting to F32
-    // Lanes 0-15 read from byte 0, lanes 16-31 read from byte 1
-    %result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
-      blockSize(32) firstScaleLane(0) firstScaleByte(0)
-      : vector<8xf8E4M3FN>, vector<4xf8E8M0FNU> -> vector<8xf32>
-
-    // Input: 16-element vector of F6E2M3FN, converting to F16
-    // Lanes 0-15 read from byte 2, lanes 16-31 read from byte 3
-    %result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
-      blockSize(32) firstScaleLane(16) firstScaleByte(2)
-      : vector<16xf6E2M3FN>, vector<4xf8E8M0FNU> -> vector<16xf16>
-    ```
-
-    When `source` is either F4E2M1FN, F6E2M3FN, or F6E3M2FN and
-    the block size is 16, `firstScaleByte` can be 0 or 1.
-    Lanes 0-15 read from the `firstScaleByte`th element of the scale vectors,
-    while lanes 16-31 read from `firstScaleByte` + 2.
-    For example:
-    ```mlir
-    // Input: 8-element vector of F8E5M2, converting to BF16
-    // Lanes 0-15 read from byte 0, lanes 16-31 read from byte 2 (0+2)
-    %result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
-      blockSize(16) firstScaleLane(0) firstScaleByte(0)
-      : vector<8xf8E5M2>, vector<4xf8E8M0FNU> -> vector<8xbf16>
-
-    // Input: 16-element vector of F6E3M2FN, converting to F32
-    // Lanes 0-15 read from byte 1, lanes 16-31 read from byte 3 (1+2)
-    %result = amdgpu.scaled_ext_packed_matrix %source scale(%scales)
-      blockSize(16) firstScaleLane(16) firstScaleByte(1)
-      : vector<16xf6E3M2FN>, vector<4xf8E8M0FNU> -> vector<16xf32>
-    ```
-
-    Note: the layout for the scales generally mirrors how the WMMA
-    instructions use for matrix scales. These selection operands allows
-    one to choose portions of the matrix to convert.
-
-    When `source` is either F8E4M3FN or F8E5M2 and `blockSize` is 32,
-    then the same byte will be used by both halves of the wave.
-    In this case, `firstScaleByte` can be any value from 0 to 3.
-
-    When `source` is either F8E4M3FN or F8E5M2 and `blockSize` is 16,
-    following combinations are allowed:
-    * `firstScaleLane(0), firstScaleByte(0)`
-    * `firstScaleLane(16), firstScaleByte(2)`
-    all other combinations are reserved.
-
-    Available on gfx1250+.
-  }];
-
-  let assemblyFormat = [{
-    attr-dict $source
-    `scale` `(` $scale `)`
-    `blockSize` `(` $blockSize `)`
-    `firstScaleLane` `(` $firstScaleLane`)`
-    `firstScaleByte` `(` $firstScaleByte `)`
-    `:` type($source) `,` type($scale) `->` type($res)
-  }];
-
-  let hasVerifier = 1;
-
-}
-
-def AMDGPU_ScaledExtPackedOp
-    : AMDGPU_Op<"scaled_ext_packed", [Pure]>,
-      Arguments<(
-          ins AnyTypeOf<[VectorOfLengthAndType<[1, 2, 3, 4], [F8E5M2, F8E4M3FN]>,
-                         VectorOfLengthAndType<[1, 2, 3, 4, 5, 6, 7, 8],
-                                               [F4E2M1FN]>]>:$source,
-          F32:$scale,
-          ConfinedAttr<I32Attr, [IntNonNegative, IntMaxValue<7>]>:$index)>,
-      Results<(
-          outs AnyTypeOf<[FixedVectorOfLengthAndType<[2], [F32]>,
-                          FixedVectorOfLengthAndType<[2], [F16]>,
-                          FixedVectorOfLengthAndType<[2], [BF16]>]>:$res)> {
-  let summary = "Extend a vector of packed floating point values";
-
-  let description = [{
-    Extend and scale two packed floats in `source[index]` to two floats and
-    return them.
-
-    This rather unusual signature arises from the fact that AMD GPUs cannot
-    easily work with sub 32-bit quantities, so the compiler intrinsics for
-    extending 8-bit floats (which are, currently, the only way to work with
-    this operation) take packed vectors of 2 such floats.
-
-    If the passed-in vector has fewer than two elements, or the input is scalar,
-    the remaining values in the <2 x i8> will be filled with
-    undefined values as needed.
-  }];
-  let assemblyFormat = [{
-    attr-dict $source `[` $index `]` `,` $scale `:` type($source) `to` type($res)
-  }];
-}
-
-def AMDGPU_PackedTrunc2xFp8Op :
-    AMDGPU_Op<"packed_trunc_2xfp8", [Pure, AttrSizedOperandSegments]>,
-    Arguments<(ins F32:$sourceA,
-      Optional<F32>:$sourceB,
-      ConfinedAttr<I32Attr, [IntNonNegative, IntMaxValue<1>]>:$wordIndex,
-      Optional<FixedVectorOfLengthAndType<[4], [F8E4M3FNUZ, F8E5M2FNUZ, F8E4M3FN, F8E5M2]>>:$existing)>,
-    Results<(outs FixedVectorOfLengthAndType<[4], [F8E4M3FNUZ, F8E5M2FNUZ, F8E4M3FN, F8E5M2]>:$res)> {
-  let summary = "Round two floats into a packed vector of 8-bit floats";
-  let description = [{
-    Round the inputs `sourceA` and `sourceB` (which is undefined if not
-    specified) into the low or high word (bottom two or top two) elements
-    of the returned vector, keeping the other two elements of `existing`
-    unchanged if present (or undefined if it was not passed in).
-
-    The reason for this odd signature is that AMD GPUs cannot easily work with
-    sub-registers, and so the conversion intrinsics (which are currently the
-    only way to work with 8-bit float types) take packed vectors of 4 8-bit
-    values.
-  }];
-  let assemblyFormat = [{
-    attr-dict $sourceA `,` ($sourceB^):(`undef`)?
-    `into` ($existing^):(`undef`)? `[` `word` $wordIndex `]`
-    `:` type($sourceA) `to` type($res) (`into` type($existing)^)?
-  }];
-  let hasVerifier = 1;
-}
-
-def AMDGPU_PackedScaledTruncOp
-    : AMDGPU_Op<"packed_scaled_trunc", [Pure]>,
-      Arguments<(ins VectorOfLengthAndType<[1, 2], [F32, F16, BF16]>:$source,
-          F32:$scale,
-  ...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/178562