[Mlir-commits] [mlir] [MLIR][GPU] Broadcast via 2 scalar registers, not ds_swizzle (PR #165879)
James Newling
llvmlistbot at llvm.org
Fri Oct 31 09:24:42 PDT 2025
https://github.com/newling created https://github.com/llvm/llvm-project/pull/165879
This PR changes the way we broadcast from 2 lanes to 64 lanes (lane 31 -> [0, 32) and lane 64 -> [32, 64)) as suggested in https://github.com/llvm/llvm-project/pull/165764. The advantage of this approach is that it should in theory be faster because it doesn't use the LDS crossbar (or something). One disadvantage might be that it uses 2 scalar registers instead of no 0 (is that a valid concern, or is this never an issue on AMDGPU?).
This code should ultimately migrate to a dialect/directory specific to AMDGPU.
>From d0a5b33365bbedd263cdbad21479aec5338bd728 Mon Sep 17 00:00:00 2001
From: James Newling <james.newling at gmail.com>
Date: Fri, 31 Oct 2025 09:19:19 -0700
Subject: [PATCH] Broadcast via 2 scalar registers, not ds_swizzle
Signed-off-by: James Newling <james.newling at gmail.com>
---
.../GPU/Transforms/SubgroupReduceLowering.cpp | 32 +++++++++++++------
.../Dialect/GPU/subgroup-reduce-lowering.mlir | 8 +++--
2 files changed, 29 insertions(+), 11 deletions(-)
diff --git a/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp b/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
index ec1571a56fe4a..52ec9a09b2fcd 100644
--- a/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
+++ b/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
@@ -432,22 +432,36 @@ createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp op,
// If subgroup size is 64 and cluster size is 64, we don't need lanes [0,
// 16) and [32, 48) to have the correct cluster-32 reduction values at
// this point, because only lane 63's value will ultimately be read in
- // this full-cluster case.
+ // this clusterSize=subgroupSize case.
//
// If subgroup size is 64 and cluster size is 32, we need to ensure that
// lanes [0, 16) and [32, 48) have the correct final cluster-32 reduction
// values (subgroup_reduce guarantees that all lanes within each cluster
// contain the final reduction value). We do this by broadcasting lane
// 31's value to lanes [0, 16) and lanes 63's value to lanes [32, 48).
- //
- // See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations
- // for an illustration of how this within-cluster broadcast works with a
- // swizzle.
if (ci.subgroupSize == 64 && ci.clusterSize == 32) {
- res =
- amdgpu::SwizzleBitModeOp::create(rewriter, loc, res, /*and_mask=*/0,
- /*or_mask=*/31,
- /*xor_mask=*/0);
+
+ Value c31 =
+ arith::ConstantOp::create(rewriter, loc, rewriter.getI32Type(),
+ rewriter.getI32IntegerAttr(31));
+ Value lane31 =
+ ROCDL::ReadlaneOp::create(rewriter, loc, res.getType(), res, c31);
+
+ Value c63 =
+ arith::ConstantOp::create(rewriter, loc, rewriter.getI32Type(),
+ rewriter.getI32IntegerAttr(63));
+ Value lane63 =
+ ROCDL::ReadlaneOp::create(rewriter, loc, res.getType(), res, c63);
+
+ Value laneId =
+ gpu::LaneIdOp::create(rewriter, loc, rewriter.getIndexAttr(64));
+
+ // If laneId < 32, select lane31, else select lane63:
+ Value lowerHalf = arith::CmpIOp::create(
+ rewriter, loc, arith::CmpIPredicate::ule, laneId,
+ arith::ConstantIndexOp::create(rewriter, loc, 31));
+
+ res = arith::SelectOp::create(rewriter, loc, lowerHalf, lane31, lane63);
}
} else if (chipset.majorVersion <= 12) {
// Use a permute lane to cross rows (row 1 <-> row 0, row 3 <-> row 2).
diff --git a/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir b/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
index 1adc4181e05d3..412a1187c6603 100644
--- a/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
+++ b/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
@@ -412,8 +412,12 @@ gpu.module @kernels {
// lanes [16, 32) and [48, 64), respectively.
// CHECK-GFX9: %[[BCAST15:.+]] = amdgpu.dpp %[[A3]] %[[A3]] row_bcast_15(unit) {row_mask = 10 : i32} : f32
// CHECK-GFX9: %[[SUM:.+]] = arith.addf %[[A3]], %[[BCAST15]] : f32
- // CHECK-GFX9: %[[SWIZ:.+]] = amdgpu.swizzle_bitmode %[[SUM]] 0 31 0 : f32
- // CHECK-GFX9: "test.consume"(%[[SWIZ]]) : (f32) -> ()
+ // CHECK-GFX9: %[[RLANE31:.+]] = rocdl.readlane %[[SUM]]{{.*}}c31
+ // CHECK-GFX9: %[[RLANE63:.+]] = rocdl.readlane %[[SUM]]{{.*}}c63
+ // CHECK-GFX9: %[[LANEID:.+]] = gpu.lane_id
+ // CHECK-GFX9: %[[CMP:.+]] = arith.cmpi ule, %[[LANEID]]{{.*}}c31
+ // CHECK-GFX9: %[[SEL:.+]] = arith.select %[[CMP]], %[[RLANE31]], %[[RLANE63]] : f32
+ // CHECK-GFX9: "test.consume"(%[[SEL]]) : (f32) -> ()
//
// On gfx1030, the final step is to permute the lanes and perform final reduction:
// CHECK-GFX10: rocdl.permlanex16
More information about the Mlir-commits
mailing list