[Mlir-commits] [mlir] [MLIR][NVGPU] Adding `nvgpu.warpgroup.mma` Op for Hopper GPUs (PR #65440)
llvmlistbot at llvm.org
llvmlistbot at llvm.org
Wed Sep 13 07:06:27 PDT 2023
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-mlir-nvgpu
<details>
<summary>Changes</summary>
This work introduces a new operation called `warpgroup.mma` to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture.
Previously, the `nvvm.wgmma.mma_async` operation was introduced to support warpgroup-level matrix operations in NVVM dialect. This op is used multiple instances of `nvvm.wgmma.mma_async` to achieve the desired shape. The new `nvgpu.warpgroup.mma` operation abstracts this complexity and provides a higher-level interface for performing warpgroup-level matrix operations.
The `nvgpu.warpgroup.mma` does followings:
1) Corresponds multiple `wgmma` instructions.
2) Iterates input matrix descriptors to achieve the desired computation shape. 3) Groups and runs `wgmma` instructions asynchronously, and eventually waits them. This are done by `wgmma.fence.aligned`, `wgmma.commit.group.sync.aligned`, and `wgmma.wait.group.sync.aligned` 4) Results fragmented matrices
Here's an example usage of the `nvgpu.warpgroup.mma` operation:
```
%wgmmaResult, %wgmmaResult2 = nvgpu.warpgroup.mma %descA, %descB, %acc, group = 1 {transposeB}:
!nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>,
vector<128x128xf32>
-> !nvgpu.warpgroup.result<tensor = !llvm.struct<...>,
!nvgpu.warpgroup.result<tensor = !llvm.struct<...>>
```
--
Patch is 36.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65440.diff
7 Files Affected:
- (modified) mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td (+56)
- (modified) mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h (+2)
- (modified) mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp (+163-3)
- (modified) mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp (+131-4)
- (modified) mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp (+15)
- (modified) mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir (+61-1)
- (modified) mlir/test/Dialect/NVGPU/invalid.mlir (+44)
<pre>
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
index a3245bf9196eed1..90381648dac6acc 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
@@ -192,6 +192,19 @@ def NVGPU_WarpgroupMatrixDescriptor : NVGPU_Type<"WarpgroupMatrixDescriptor", "w
let assemblyFormat = "`<` struct(params) `>`";
}
+def NVGPU_WarpgroupAccumulator : NVGPU_Type<"WarpgroupAccumulator", "warpgroup.accumulator", []> {
+ let parameters = (ins "VectorType":$fragmented);
+ let assemblyFormat = "`<` struct(params) `>`";
+ let description = [{
+ This type represents the result matrix obtained from `nvgpu.warpgroup.mma`.
+ The `$fragmented` type signifies the distributed or fragmented result
+ vector that is collectively owned by all the threads in the warp-group
+ that executed `nvgpu.warpgroup.mma`.
+ [See the details of register fragment layout for accumulator matrix D]
+ (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d)
+ }];
+}
+
//===----------------------------------------------------------------------===//
// NVGPU Op Definitions
//===----------------------------------------------------------------------===//
@@ -664,5 +677,48 @@ def NVGPU_GenerateGmmaDescriptorOp : NVGPU_Op<"wgmma.generate.descriptor", []> {
let hasVerifier = 1;
}
+def NVGPU_WarpgroupMmaOp : NVGPU_Op<"warpgroup.mma"> {
+ let description = [{
+ The `nvgpu.warpgroup.mma` op performs the warpgroup-level (4 warps)
+ matrix-multiply-and-accumulate (mma) operation that results in
+ `nvvm.wgmma.mma_async`.
+
+ The operands are `descriptorA` and `descriptorB` that are wgmma matrix
+ descriptors that shows the properties of the matrix in shared memory. The
+ results are thread-level ownership to the warpgroup-level mma operation
+ shape. The shape is deduced from the descriptor types and output vector.
+
+ The Op corresponds multiple `nvvm.wgmma.mma_async` operations to complete the
+ given shape. As the instruction `nvvm.wgmma.async` is an asynchronous,
+ this Op groups the `nvvm.wgmma.async` and surrounds them between
+ `wgmma.fence.aligned` and `wgmma.commit.group.sync.aligned`,
+ `wgmma.wait.group.sync.aligned` Ops.
+
+ Example:
+ ```mlir
+ %r1,%r2 = nvgpu.warpgroup.mma %wgmmaDescA, %wgmmaDescB, %acc1, %acc2:
+ !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>,
+ !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>,
+ !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
+ !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
+ ->
+ !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
+ !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
+ ```
+ }];
+
+ let arguments = (ins NVGPU_WarpgroupMatrixDescriptor:$descriptorA,
+ NVGPU_WarpgroupMatrixDescriptor:$descriptorB,
+ DefaultValuedOptionalAttr<I32Attr, "1">:$waitGroup,
+ OptionalAttr<UnitAttr>:$transposeA,
+ OptionalAttr<UnitAttr>:$transposeB,
+ Variadic<NVGPU_WarpgroupAccumulator>:$matrixC);
+ let results = (outs Variadic<NVGPU_WarpgroupAccumulator>:$matrixD);
+ let assemblyFormat = [{
+ $descriptorA`,` $descriptorB`,` $matrixC attr-dict
+ `:` type($descriptorA) `,` type($descriptorB) `,` type($matrixC) `->` type($matrixD)
+ }];
+ let hasVerifier = 1;
+}
#endif // NVGPU
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h
index 192afcb2dba7913..96af26842dafea2 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPUDialect.h
@@ -21,6 +21,8 @@
#include "mlir/Dialect/NVGPU/IR/NVGPUEnums.h.inc"
+constexpr int kWarpSize = 32;
+
#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.h.inc"
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index b045089244ff1a7..046727e4ea9ab83 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -17,10 +17,12 @@
#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
+#include "mlir/Dialect/SCF/Transforms/Patterns.h"
#include "mlir/IR/PatternMatch.h"
#include "mlir/IR/TypeUtilities.h"
#include "mlir/Pass/Pass.h"
#include "llvm/Support/Debug.h"
+#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/raw_ostream.h"
#define DEBUG_TYPE "nvgpu-to-nvvm"
@@ -34,6 +36,10 @@ namespace mlir {
using namespace mlir;
+/// Number of bits that needs to excluded when building matrix descriptor for
+/// wgmma operations.
+constexpr int exclude4LSB = 4;
+
/// GPU has 32 bit registers, this function truncates values when larger width
/// is not needed.
static Value truncToI32(ConversionPatternRewriter &rewriter, Location loc,
@@ -419,6 +425,15 @@ struct ConvertNVGPUToNVVMPass
converter.addConversion([&](nvgpu::DeviceAsyncTokenType type) -> Type {
return converter.convertType(IntegerType::get(type.getContext(), 32));
});
+ converter.addConversion([&](nvgpu::WarpgroupAccumulatorType type) -> Type {
+ VectorType vtype = type.getFragmented();
+ SmallVector<Type> structBody;
+ for (unsigned i = 0; i < vtype.getDimSize(0); i++)
+ structBody.push_back(vtype.getElementType());
+ auto convertedType =
+ LLVM::LLVMStructType::getLiteral(type.getContext(), structBody);
+ return converter.convertType(convertedType);
+ });
converter.addConversion([&](nvgpu::MBarrierTokenType type) -> Type {
return converter.convertType(IntegerType::get(type.getContext(), 64));
});
@@ -438,6 +453,8 @@ struct ConvertNVGPUToNVVMPass
target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
target.addLegalDialect<::mlir::memref::MemRefDialect>();
target.addLegalDialect<::mlir::NVVM::NVVMDialect>();
+ mlir::scf::populateSCFStructuralTypeConversionsAndLegality(
+ converter, patterns, target);
if (failed(applyPartialConversion(getOperation(), target,
std::move(patterns))))
signalPassFailure();
@@ -984,10 +1001,9 @@ struct NVGPUGenerateGmmaDescriptorLowering
shiftLeft(val, startBit));
};
- int ex4LSB = 4;
int64_t sizeN = op.getTensorMap().getType().getTensor().getDimSize(0);
- uint64_t strideDimVal = (layout << 3) >> ex4LSB;
- uint64_t leadDimVal = (sizeN * layout) >> ex4LSB;
+ uint64_t strideDimVal = (layout << 3) >> exclude4LSB;
+ uint64_t leadDimVal = (sizeN * layout) >> exclude4LSB;
uint64_t offsetVal = 0;
Value strideDim = makeConst(strideDimVal);
@@ -1141,6 +1157,149 @@ struct NVGPUTmaCreateDescriptorOpLowering
}
};
+struct NVGPUWarpgroupMmaOpLowering
+ : public ConvertOpToLLVMPattern<nvgpu::WarpgroupMmaOp> {
+ using ConvertOpToLLVMPattern<nvgpu::WarpgroupMmaOp>::ConvertOpToLLVMPattern;
+
+ LogicalResult getWgmmaShape(int64_t sizeM, int64_t sizeN, Type inputElemType,
+ int &wgmmaShapeM, int &wgmmaShapeN,
+ int &wgmmaShapeK) const {
+ wgmmaShapeM = 64;
+ wgmmaShapeN = sizeN;
+ if (inputElemType.isTF32()) {
+ wgmmaShapeK = 8;
+ } else if (inputElemType.isF16() || inputElemType.isBF16()) {
+ wgmmaShapeK = 16;
+ } else if (inputElemType.isFloat8E4M3FN() || inputElemType.isFloat8E5M2() ||
+ inputElemType.isInteger(16)) {
+ wgmmaShapeK = 32;
+ } else if (inputElemType.isInteger(1)) {
+ wgmmaShapeK = 256;
+ } else {
+ llvm_unreachable("msg: not supported K shape");
+ }
+ LLVM_DEBUG(DBGS() << "Generating wgmma.mma.async shape[m = " << wgmmaShapeM
+ << ", n = " << wgmmaShapeN << ", k = " << wgmmaShapeK
+ << "]\n");
+ return success();
+ }
+
+ Value generateNVVMWgmmaOp(MLIRContext *ctx,
+ ConversionPatternRewriter &rewriter, Location loc,
+ int m, int n, int k, Type resultStructType,
+ Value inout, Value descriptorA,
+ Value descriptorB) const {
+ TypeRange resultTypes = {resultStructType};
+ auto shape = NVVM::MMAShapeAttr::get(ctx, m, n, k);
+ auto scaleOut = NVVM::WGMMAScaleOutAttr::get(ctx, NVVM::WGMMAScaleOut::one);
+ auto scaleIn = NVVM::WGMMAScaleInAttr::get(ctx, NVVM::WGMMAScaleIn::one);
+ auto layoutA = NVVM::MMALayoutAttr::get(ctx, NVVM::MMALayout::row);
+ auto layoutB = NVVM::MMALayoutAttr::get(ctx, NVVM::MMALayout::col);
+ // todo input type
+ auto itype = NVVM::WGMMATypesAttr::get(ctx, NVVM::WGMMATypes::f16);
+ auto overflow =
+ NVVM::MMAIntOverflowAttr::get(ctx, NVVM::MMAIntOverflow::wrapped);
+ Value res = rewriter.create<NVVM::WgmmaMmaAsyncOp>(
+ loc, resultTypes, inout, descriptorA, descriptorB, shape, itype, itype,
+ scaleOut, scaleIn, scaleIn, layoutA, layoutB, overflow);
+ return res;
+ }
+
+ LogicalResult
+ matchAndRewrite(nvgpu::WarpgroupMmaOp op, OpAdaptor adaptor,
+ ConversionPatternRewriter &rewriter) const override {
+ int64_t sizeM = op.getDescriptorA().getType().getTensor().getDimSize(0);
+ int64_t sizeN = op.getDescriptorB().getType().getTensor().getDimSize(1);
+ int64_t sizeK = op.getDescriptorA().getType().getTensor().getDimSize(1);
+
+ LLVM_DEBUG(DBGS() << "===--- GEMM D[" << sizeM << "][" << sizeN << "] += A["
+ << sizeM << "][" << sizeK << "] * B[" << sizeK << "]["
+ << sizeN << "] ---===\n");
+
+ int wgmmaShapeM, wgmmaShapeN, wgmmaShapeK;
+ if (failed(getWgmmaShape(sizeM, sizeN, rewriter.getF16Type(), wgmmaShapeM,
+ wgmmaShapeN, wgmmaShapeK))) {
+ return failure();
+ }
+
+ Value descriptorA = adaptor.getDescriptorA();
+ Value descriptorB = adaptor.getDescriptorB();
+
+ // Generate wgmma group
+
+ auto loc = op->getLoc();
+ MemRefType typeTensorA = op.getDescriptorA().getType().getTensor();
+ MemRefType typeTensorB = op.getDescriptorB().getType().getTensor();
+
+ auto makeAdd = [&](Value lhs, Value rhs) -> Value {
+ return rewriter.create<LLVM::AddOp>(loc, lhs.getType(), lhs, rhs);
+ };
+
+ auto iterateDescA = [&](Value desc, int iterM, int iterN,
+ int iterK) -> Value {
+ // todo : Handle column major
+ int byte = typeTensorA.getElementTypeBitWidth() / 8;
+ int tileShapeA = typeTensorA.getDimSize(1);
+ int incrementVal =
+ ((wgmmaShapeK * iterK) + (sizeK * tileShapeA * iterM)) * byte;
+ incrementVal = incrementVal >> exclude4LSB;
+ LLVM_DEBUG(DBGS() << "\t\t[m: " << iterM << " n: " << iterN << " k: "
+ << iterK << "] [wgmma descriptors] Descriptor A + "
+ << incrementVal << " | \t ");
+ if (!incrementVal)
+ return desc;
+ return makeAdd(desc, makeI64Const(rewriter, op, incrementVal));
+ };
+
+ auto iterateDescB = [&](Value desc, int iterM, int iterN,
+ int iterK) -> Value {
+ // todo : Handle row major
+ int byte = typeTensorB.getElementTypeBitWidth() / 8;
+ int incrementVal = typeTensorB.getDimSize(0) * wgmmaShapeK * iterK * byte;
+ incrementVal = incrementVal >> exclude4LSB;
+ LLVM_DEBUG(DBGSE() << "Descriptor B + " << incrementVal << "\n");
+ if (!incrementVal)
+ return desc;
+ return makeAdd(desc, makeI64Const(rewriter, op, incrementVal));
+ };
+
+ rewriter.create<NVVM::WgmmaFenceAlignedOp>(loc);
+
+ SmallVector<Value> wgmmaResults;
+ for (int iterM = 0; iterM < (sizeM / wgmmaShapeM); iterM++) {
+ Value matrixC = adaptor.getMatrixC()[iterM];
+ Value matrixD = op.getMatrixD()[iterM];
+ Type structType = getTypeConverter()->convertType(matrixD.getType());
+ LLVM_DEBUG(DBGS() << " D[" << (iterM * wgmmaShapeM) << ":"
+ << (iterM * wgmmaShapeM) + wgmmaShapeM << "][" << 0
+ << ":" << wgmmaShapeN << "] += \n");
+ for (int iterK = 0; iterK < (sizeK / wgmmaShapeK); iterK++) {
+ Value descA = iterateDescA(descriptorA, iterM, 0, iterK);
+ Value descB = iterateDescB(descriptorB, iterM, 0, iterK);
+ LLVM_DEBUG(DBGS() << "\t wgmma."
+ << "m" << wgmmaShapeM << "n" << wgmmaShapeN << "k"
+ << wgmmaShapeK << "(A[" << (iterM * wgmmaShapeM)
+ << ":" << (iterM * wgmmaShapeM) + wgmmaShapeM << "]["
+ << (iterK * wgmmaShapeK) << ":"
+ << (iterK * wgmmaShapeK + wgmmaShapeK) << "] * "
+ << " B[" << (iterK * wgmmaShapeK) << ":"
+ << (iterK * wgmmaShapeK + wgmmaShapeK) << "][" << 0
+ << ":" << wgmmaShapeN << "])\n");
+ matrixC = generateNVVMWgmmaOp(op->getContext(), rewriter, loc,
+ wgmmaShapeM, wgmmaShapeN, wgmmaShapeK,
+ structType, matrixC, descA, descB);
+ }
+ wgmmaResults.push_back(matrixC);
+ }
+ rewriter.create<NVVM::WgmmaGroupSyncAlignedOp>(loc);
+ rewriter.create<NVVM::WgmmaWaitGroupSyncOp>(loc, op.getWaitGroup());
+
+ ValueRange myres(wgmmaResults);
+ rewriter.replaceOp(op, myres);
+ return success();
+ }
+};
+
} // namespace
void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
@@ -1156,6 +1315,7 @@ void mlir::populateNVGPUToNVVMConversionPatterns(LLVMTypeConverter &converter,
NVGPUTmaCreateDescriptorOpLowering, // nvgpu.tma.create.descriptor
NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx
NVGPUGenerateGmmaDescriptorLowering, // nvgpu.wgmma.generate.descriptor
+ NVGPUWarpgroupMmaOpLowering, // nvgpu.warpgroup.mma
MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,
NVGPUMmaSparseSyncLowering>(converter);
diff --git a/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp b/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
index d832a983a132d61..d96ed69982870b4 100644
--- a/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
+++ b/mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp
@@ -22,6 +22,7 @@
#include "mlir/IR/PatternMatch.h"
#include "mlir/IR/TypeUtilities.h"
#include "mlir/IR/Verifier.h"
+#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/StringExtras.h"
#include "llvm/ADT/TypeSwitch.h"
@@ -151,7 +152,6 @@ static LogicalResult verifyMmaSyncOp(Operation *op,
// - For F32 (TF32), F16, S8, and S4 data
// types the fundamental tensor core operation is of shape 8-by-8-by-128b.
// - F64 is an exception and is of shape 8-by-8-by-256b.
- constexpr int kThreads = 32; // 32 threads per warp
int64_t shapeM = 8;
int64_t shapeN = 8;
int64_t shapeK; // set based on data type (128b for all data types except F64)
@@ -206,17 +206,17 @@ static LogicalResult verifyMmaSyncOp(Operation *op,
// verify warp-wide size for vector a
int64_t sparseFactor = sparse ? 2 : 1;
- if (aShape[0] * aShape[1] * kThreads != m * k / sparseFactor)
+ if (aShape[0] * aShape[1] * kWarpSize != m * k / sparseFactor)
return op->emitOpError()
<< "expected " << m * k << " warp-wide matrix A elements";
// verify warp-wide size for vector b
- if (bShape[0] * bShape[1] * kThreads != k * n)
+ if (bShape[0] * bShape[1] * kWarpSize != k * n)
return op->emitOpError()
<< "expected " << k * n << " warp-wide matrix B elements";
// verify warp-wide size for vector c
- if (cShape[0] * cShape[1] * kThreads != m * n)
+ if (cShape[0] * cShape[1] * kWarpSize != m * n)
return op->emitOpError()
<< "expected " << m * n << " warp-wide matrix C elements";
@@ -402,6 +402,133 @@ LogicalResult GenerateGmmaDescriptorOp::verify() {
return success();
}
+//===----------------------------------------------------------------------===//
+// WarpgroupMmaOp
+//===----------------------------------------------------------------------===//
+
+LogicalResult isAllowedWGMMADataType(Type typeD, Type typeA, Type typeB) {
+ // F32 += F16 + F16
+ // F16 += F16 + F16
+ if (typeA.isF16() && typeB.isF16() && (typeD.isF32() || typeD.isF16()))
+ return success();
+ // F32 += TF32 + TF32
+ if (typeA.isTF32() && typeD.isF32() && typeB.isTF32())
+ return success();
+ // s32 += i8 + i8
+ if (typeA.isInteger(16) && typeB.isInteger(16) && typeD.isInteger(32))
+ return success();
+ // s32 += i1 + i1
+ if (typeA.isInteger(1) && typeB.isInteger(1) && typeD.isInteger(32))
+ return success();
+ // F32 += BF16 + BF16
+ // F16 += BF16 + BF16
+ if (typeA.isBF16() && typeB.isBF16() && (typeD.isF32() || typeD.isF16()))
+ return success();
+ // F16 += f8 + f8
+ // F32 += f8 + f8
+ if ((typeA.isFloat8E5M2() || typeA.isFloat8E4M3FN()) &&
+ (typeB.isFloat8E5M2() || typeB.isFloat8E4M3FN()) &&
+ (typeD.isF32() || typeD.isF16()))
+ return success();
+
+ return failure();
+}
+
+LogicalResult isAllowedSizeN(int sizeN, Type typeA) {
+ SmallVector<int> allowedN = {8, 16, 24, 32, 40, 48, 56, 64,
+ 72, 80, 88, 96, 104, 112, 120, 128,
+ 136, 144, 152, 160, 168, 176, 184, 192,
+ 200, 208, 216, 224, 232, 240, 248, 256};
+ SmallVector<int> allowedNshort = {8, 16, 24, 32, 48, 64,
+ 80, 96, 112, 128, 144, 160,
+ 176, 192, 208, 224, 240, 256};
+ if (typeA.isBF16() || typeA.isF16() || typeA.isTF32() ||
+ typeA.isFloat8E4M3FN() || typeA.isFloat8E5M2())
+ if (llvm::any_of(allo...
<truncated>
</pre>
</details>
https://github.com/llvm/llvm-project/pull/65440
More information about the Mlir-commits
mailing list