[Mlir-commits] [mlir] [mlir][nvgpu] Introduce Multicast Capability to `nvgpu.tma.async.load` (PR #76935)
Guray Ozen
llvmlistbot at llvm.org
Thu Jan 4 02:21:40 PST 2024
https://github.com/grypp created https://github.com/llvm/llvm-project/pull/76935
This PR improves the functionality of the `nvgpu.tma.async.load` Op by adding support for multicast. While we already had this capability in the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM Op, this PR lowers mask information to the NVVM operation.
>From e8bb07467b958d85eac35a41af8de7a58ede0e20 Mon Sep 17 00:00:00 2001
From: Guray Ozen <guray.ozen at gmail.com>
Date: Thu, 4 Jan 2024 11:20:46 +0100
Subject: [PATCH] [mlir][nvgpu] Introduce Multicast Capability to
`nvgpu.tma.async.load`
This PR improves the functionality of the `nvgpu.tma.async.load` Op by adding support for multicast. While we already had this capability in the lower-level `nvvm.cp.async.bulk.tensor.shared.cluster.global` NVVM Op, this PR lowers mask information to the NVVM operation.
---
mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td | 16 +++++++------
.../Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp | 3 ++-
.../NVGPU/TransformOps/NVGPUTransformOps.cpp | 2 +-
.../Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir | 23 +++++++++++++++++++
4 files changed, 35 insertions(+), 9 deletions(-)
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
index 440f7d0380eb17..7e139663d74b47 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
@@ -642,16 +642,18 @@ def NVGPU_TmaAsyncLoadOp : NVGPU_Op<"tma.async.load", [AttrSizedOperandSegments]
The Op uses `$barrier` mbarrier based completion mechanism.
}];
- let arguments = (ins Arg<AnyMemRef, "", [MemWriteAt<0, FullEffect>]>:$dst,
- NVGPU_MBarrierGroup:$barriers,
- NVGPU_TensorMapDescriptor:$tensorMapDescriptor,
- Variadic<Index>:$coordinates,
- Index:$mbarId,
- Optional<I1>:$predicate);
+ let arguments = (ins Arg<AnyMemRef, "", [MemWriteAt<0, FullEffect>]>:$dst,
+ NVGPU_MBarrierGroup:$barriers,
+ NVGPU_TensorMapDescriptor:$tensorMapDescriptor,
+ Variadic<Index>:$coordinates,
+ Index:$mbarId,
+ Optional<I16>:$multicastMask,
+ Optional<I1>:$predicate);
let assemblyFormat = [{
$tensorMapDescriptor `[` $coordinates `]` `,` $barriers `[` $mbarId `]`
`to` $dst
- (`,` `predicate` `=` $predicate^)?
+ (`multicast_mask` `=` $multicastMask^ )?
+ (`,` `predicate` `=` $predicate^)?
attr-dict `:` type($tensorMapDescriptor) `,` type($barriers)
`->` type($dst)
}];
diff --git a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
index 9cd3a5ce65ce5f..db84e5cf62a5e9 100644
--- a/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
+++ b/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
@@ -990,7 +990,8 @@ struct NVGPUTmaAsyncLoadOpLowering
}
rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(
op, dest, adaptor.getTensorMapDescriptor(), coords, barrier,
- ValueRange{}, Value{}, Value{}, adaptor.getPredicate());
+ ValueRange{}, adaptor.getMulticastMask(), Value{},
+ adaptor.getPredicate());
return success();
}
};
diff --git a/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp b/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
index b68bed3aa53cf9..aebdd0a4ee4147 100644
--- a/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
+++ b/mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
@@ -980,7 +980,7 @@ OpFoldResult HopperBuilder::buildTmaAsyncLoad(
Value zero = rewriter.create<arith::ConstantIndexOp>(loc, 0);
Operation *loadOp = rewriter.create<nvgpu::TmaAsyncLoadOp>(
loc, sharedMemref, barrier, globalDesc, ValueRange{zero, zero}, zero,
- Value());
+ Value(), Value());
loadOps.push_back(loadOp);
auto mixedSizes = memref::getMixedSizes(rewriter, loc, sharedMemref);
SmallVector<AffineExpr> symbols(mixedSizes.size());
diff --git a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
index e11449e6f7c457..b8a0f75d1cc8b9 100644
--- a/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
+++ b/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
@@ -704,6 +704,29 @@ func.func @async_tma_load_pred(%tensorMap1d: !tensorMap1d, %tensorMap2d: !tensor
func.return
}
+func.func @async_tma_load_multicast(
+ %tensorMap1d: !tensorMap1d, %tensorMap2d: !tensorMap2d,
+ %tensorMap3d: !tensorMap3d, %tensorMap4d: !tensorMap4d,
+ %tensorMap5d: !tensorMap5d, %buffer1d: memref<128xf32,3>,
+ %buffer2d: memref<32x32xf32,3>, %buffer3d: memref<2x32x32xf32,3>,
+ %buffer4d: memref<2x2x32x32xf32,3>, %buffer5d: memref<2x2x2x32x32xf32,3>,
+ %mbarrier: !mbarrier,
+ %multicastMask: i16) {
+ %c0 = arith.constant 0 : index
+ %crd0 = arith.constant 0 : index
+ %crd1 = arith.constant 0 : index
+ // CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}]
+ nvgpu.tma.async.load %tensorMap1d[%crd0], %mbarrier[%c0] to %buffer1d multicast_mask = %multicastMask : !tensorMap1d, !mbarrier -> memref<128xf32,3>
+ // CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}]
+ nvgpu.tma.async.load %tensorMap2d[%crd0, %crd1], %mbarrier[%c0] to %buffer2d multicast_mask = %multicastMask : !tensorMap2d, !mbarrier -> memref<32x32xf32,3>
+ // CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}]
+ nvgpu.tma.async.load %tensorMap3d[%crd0, %crd1, %crd0], %mbarrier[%c0] to %buffer3d multicast_mask = %multicastMask : !tensorMap3d, !mbarrier -> memref<2x32x32xf32,3>
+ // CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}]
+ nvgpu.tma.async.load %tensorMap4d[%crd0, %crd1, %crd1, %crd0], %mbarrier[%c0] to %buffer4d multicast_mask = %multicastMask : !tensorMap4d, !mbarrier -> memref<2x2x32x32xf32,3>
+ // CHECK: nvvm.cp.async.bulk.tensor.shared.cluster.global %{{.*}}, %{{.*}}, %{{.*}} box[%{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}, %{{.*}}]
+ nvgpu.tma.async.load %tensorMap5d[%crd0, %crd1, %crd1, %crd0, %crd0], %mbarrier[%c0] to %buffer5d multicast_mask = %multicastMask : !tensorMap5d, !mbarrier -> memref<2x2x2x32x32xf32,3>
+ func.return
+}
func.func @create_tensor_map(%devicePtr2d : memref<64x128xf32>, %devicePtr1d : memref<128xf32>) {
%crd0 = arith.constant 64 : index
More information about the Mlir-commits
mailing list