[Mlir-commits] [mlir] [MLIR][GPU] Broadcast via 2 scalar registers, not ds_swizzle (PR #165879)

Mon Nov 3 08:31:47 PST 2025

================
@@ -432,22 +432,36 @@ createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp op,
       // If subgroup size is 64 and cluster size is 64, we don't need lanes [0,
       // 16) and [32, 48) to have the correct cluster-32 reduction values at
       // this point, because only lane 63's value will ultimately be read in
-      // this full-cluster case.
+      // this clusterSize=subgroupSize case.
       //
       // If subgroup size is 64 and cluster size is 32, we need to ensure that
       // lanes [0, 16) and [32, 48) have the correct final cluster-32 reduction
       // values (subgroup_reduce guarantees that all lanes within each cluster
       // contain the final reduction value). We do this by broadcasting lane
       // 31's value to lanes [0, 16) and lanes 63's value to lanes [32, 48).
-      //
-      // See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations
-      // for an illustration of how this within-cluster broadcast works with a
-      // swizzle.
       if (ci.subgroupSize == 64 && ci.clusterSize == 32) {
-        res =
-            amdgpu::SwizzleBitModeOp::create(rewriter, loc, res, /*and_mask=*/0,
-                                             /*or_mask=*/31,
-                                             /*xor_mask=*/0);
+
+        Value c31 =
----------------
krzysz00 wrote:

Ok, so, as a further note: if we're on `gfx950`, we have `v_permlane16_swap` , which does exactly what we want. So we should add a special case for gfx950 (and consider it for gfx1250)

https://github.com/llvm/llvm-project/pull/165879