[Mlir-commits] [mlir] [MLIR][GPU] Broadcast via 2 scalar registers, not ds_swizzle (PR #165879)
Krzysztof Drewniak
llvmlistbot at llvm.org
Mon Nov 3 08:31:47 PST 2025
================
@@ -432,22 +432,36 @@ createSubgroupDPPReduction(PatternRewriter &rewriter, gpu::SubgroupReduceOp op,
// If subgroup size is 64 and cluster size is 64, we don't need lanes [0,
// 16) and [32, 48) to have the correct cluster-32 reduction values at
// this point, because only lane 63's value will ultimately be read in
- // this full-cluster case.
+ // this clusterSize=subgroupSize case.
//
// If subgroup size is 64 and cluster size is 32, we need to ensure that
// lanes [0, 16) and [32, 48) have the correct final cluster-32 reduction
// values (subgroup_reduce guarantees that all lanes within each cluster
// contain the final reduction value). We do this by broadcasting lane
// 31's value to lanes [0, 16) and lanes 63's value to lanes [32, 48).
- //
- // See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations
- // for an illustration of how this within-cluster broadcast works with a
- // swizzle.
if (ci.subgroupSize == 64 && ci.clusterSize == 32) {
- res =
- amdgpu::SwizzleBitModeOp::create(rewriter, loc, res, /*and_mask=*/0,
- /*or_mask=*/31,
- /*xor_mask=*/0);
+
+ Value c31 =
----------------
krzysz00 wrote:
Ok, so, as a further note: if we're on `gfx950`, we have `v_permlane16_swap` , which does exactly what we want. So we should add a special case for gfx950 (and consider it for gfx1250)
https://github.com/llvm/llvm-project/pull/165879
More information about the Mlir-commits
mailing list