[Mlir-commits] [mlir] [MLIR][GPU] Broadcast via 2 scalar registers, not ds_swizzle (PR #165879)

Fri Oct 31 09:24:42 PDT 2025

https://github.com/newling created https://github.com/llvm/llvm-project/pull/165879

This PR changes the way we broadcast from 2 lanes to 64 lanes (lane 31 -> [0, 32) and lane 64 -> [32, 64)) as suggested in https://github.com/llvm/llvm-project/pull/165764. The advantage of this approach is that it should in theory be faster because it doesn't use the LDS crossbar (or something). One disadvantage might be that it uses 2 scalar registers instead of no 0 (is that a valid concern, or is this never an issue on AMDGPU?). 

This code should ultimately migrate to a dialect/directory specific to AMDGPU. 


>From d0a5b33365bbedd263cdbad21479aec5338bd728 Mon Sep 17 00:00:00 2001
From: James Newling <james.newling at gmail.com>
Date: Fri, 31 Oct 2025 09:19:19 -0700
Subject: [PATCH] Broadcast via 2 scalar registers, not ds_swizzle

Signed-off-by: James Newling <james.newling at gmail.com>
---
 .../GPU/Transforms/SubgroupReduceLowering.cpp | 32 +++++++++++++------
 .../Dialect/GPU/subgroup-reduce-lowering.mlir |  8 +++--
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp b/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
index ec1571a56fe4a..52ec9a09b2fcd 100644
--- a/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
+++ b/mlir/lib/Dialect/GPU/Transforms/SubgroupReduceLowering.cpp
@@ -432,22 +432,36 @@ createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp op,
       // If subgroup size is 64 and cluster size is 64, we don't need lanes [0,
       // 16) and [32, 48) to have the correct cluster-32 reduction values at
       // this point, because only lane 63's value will ultimately be read in
-      // this full-cluster case.
+      // this clusterSize=subgroupSize case.
       //
       // If subgroup size is 64 and cluster size is 32, we need to ensure that
       // lanes [0, 16) and [32, 48) have the correct final cluster-32 reduction
       // values (subgroup_reduce guarantees that all lanes within each cluster
       // contain the final reduction value). We do this by broadcasting lane
       // 31's value to lanes [0, 16) and lanes 63's value to lanes [32, 48).
-      //
-      // See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations
-      // for an illustration of how this within-cluster broadcast works with a
-      // swizzle.
       if (ci.subgroupSize == 64 && ci.clusterSize == 32) {
-        res =
-            amdgpu::SwizzleBitModeOp::create(rewriter, loc, res, /*and_mask=*/0,
-                                             /*or_mask=*/31,
-                                             /*xor_mask=*/0);
+
+        Value c31 =
+            arith::ConstantOp::create(rewriter, loc, rewriter.getI32Type(),
+                                      rewriter.getI32IntegerAttr(31));
+        Value lane31 =
+            ROCDL::ReadlaneOp::create(rewriter, loc, res.getType(), res, c31);
+
+        Value c63 =
+            arith::ConstantOp::create(rewriter, loc, rewriter.getI32Type(),
+                                      rewriter.getI32IntegerAttr(63));
+        Value lane63 =
+            ROCDL::ReadlaneOp::create(rewriter, loc, res.getType(), res, c63);
+
+        Value laneId =
+            gpu::LaneIdOp::create(rewriter, loc, rewriter.getIndexAttr(64));
+
+        // If laneId < 32, select lane31, else select lane63:
+        Value lowerHalf = arith::CmpIOp::create(
+            rewriter, loc, arith::CmpIPredicate::ule, laneId,
+            arith::ConstantIndexOp::create(rewriter, loc, 31));
+
+        res = arith::SelectOp::create(rewriter, loc, lowerHalf, lane31, lane63);
       }
     } else if (chipset.majorVersion <= 12) {
       // Use a permute lane to cross rows (row 1 <-> row 0, row 3 <-> row 2).
diff --git a/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir b/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
index 1adc4181e05d3..412a1187c6603 100644
--- a/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
+++ b/mlir/test/Dialect/GPU/subgroup-reduce-lowering.mlir
@@ -412,8 +412,12 @@ gpu.module @kernels {
   //         lanes [16, 32) and [48, 64), respectively.
   // CHECK-GFX9: %[[BCAST15:.+]] = amdgpu.dpp %[[A3]] %[[A3]]  row_bcast_15(unit) {row_mask = 10 : i32} : f32
   // CHECK-GFX9: %[[SUM:.+]] = arith.addf %[[A3]], %[[BCAST15]] : f32
-  // CHECK-GFX9: %[[SWIZ:.+]] = amdgpu.swizzle_bitmode %[[SUM]] 0 31 0 : f32
-  // CHECK-GFX9: "test.consume"(%[[SWIZ]]) : (f32) -> ()
+  // CHECK-GFX9: %[[RLANE31:.+]] = rocdl.readlane %[[SUM]]{{.*}}c31
+  // CHECK-GFX9: %[[RLANE63:.+]] = rocdl.readlane %[[SUM]]{{.*}}c63
+  // CHECK-GFX9: %[[LANEID:.+]] = gpu.lane_id
+  // CHECK-GFX9: %[[CMP:.+]] = arith.cmpi ule, %[[LANEID]]{{.*}}c31
+  // CHECK-GFX9: %[[SEL:.+]] = arith.select %[[CMP]], %[[RLANE31]], %[[RLANE63]] : f32
+  // CHECK-GFX9: "test.consume"(%[[SEL]]) : (f32) -> ()
   //
   //   On gfx1030, the final step is to permute the lanes and perform final reduction:
   // CHECK-GFX10: rocdl.permlanex16