[Mlir-commits] [mlir] [MLIR][XeGPU] Use tree-reduction optimization in multi_reduction unrolling (PR #198307)
Jianhui Li
llvmlistbot at llvm.org
Mon May 18 18:00:39 PDT 2026
================
@@ -804,13 +806,152 @@ struct UnrollConvertLayoutOp : public UnrollPattern<xegpu::ConvertLayoutOp> {
}
};
+/// Unrolls vector.multi_reduction by performing tree reduction with
+/// elementwise arith operations first, then a single multi_reduction
+/// per non-reduced tile position. This avoids generating long chains of
+/// multi_reduction ops (as the upstream pattern does) and is more efficient.
+///
+/// Example:
+/// vector.multi_reduction <32,64> to <32> (tile_shape=32, 32)
+/// -- Upstream pattern generates:
+/// %tmp1 = vector.multi_reduction %tile0, %zero_acc <32,32> to <32>
+/// %res = vector.multi_reduction %tmp1, %tile1 <32,32> to <32>
+/// -- This pattern generates:
+/// %tmp1 = arith.reduction %tile0, %tile1 <32,32> -> <32x,2> // elementwise
+/// %res = vector.multi_reduction %tmp1, %zero_acc <32,32> to <32>
+///
+/// The patterns supports any-D vectors but only handles the case where there
+/// is a single reduction dimension that is the innermost dim.
----------------
Jianhui-Li wrote:
● The tree reduction structure is unnecessary here. Both sequential reduction (tile0 +
tile1 → tmp; tmp + tile2 → ...) and tree reduction ((tile0 + tile1), (tile2 + tile3) →
...) require exactly N-1 elementwise operations. There is no saving in total step count —
and since these ops execute sequentially on the hardware, there is no latency benefit
either.
The real issue with the upstream unrolling is the order of operations: it performs the
in-tile reduction first (collapsing the reduction dimension within each tile via
multi_reduction), then chains the cross-tile accumulation. This is expensive on GPU because
in-tile reduction involves cross-lane shuffles, and doing it N times multiplies that
cost.
The fix should be simpler:
1. Cross-tile reduction first: collect all tiles along the reduction dimension and
combine them with elementwise arith ops (cheap, no shuffles). This collapses N tiles into
a single tile per position.
2. In-tile reduction last: perform one multi_reduction per position to eliminate the
reduction dimension. This is the only step that requires cross-lane shuffles, and it
happens exactly once per output tile.
For multiple reduction dimensions, the same principle applies: reduce each reduction
dimension down to tile size using elementwise ops first, then perform the final in-tile
reduction to remove those dimensions entirely. In this way, we can also remove the
limitation of this PR (only handling the innermost dim reduction).
https://github.com/llvm/llvm-project/pull/198307
More information about the Mlir-commits
mailing list