[Mlir-commits] [mlir] [MLIR][XeGPU] Use tree-reduction optimization in multi_reduction unrolling (PR #198307)

Mon May 18 18:00:39 PDT 2026

================
@@ -804,13 +806,152 @@ struct UnrollConvertLayoutOp : public UnrollPattern<xegpu::ConvertLayoutOp> {
   }
 };
 
+/// Unrolls vector.multi_reduction by performing tree reduction with
+/// elementwise arith operations first, then a single multi_reduction
+/// per non-reduced tile position. This avoids generating long chains of
+/// multi_reduction ops (as the upstream pattern does) and is more efficient.
+///
+/// Example:
+/// vector.multi_reduction <32,64> to <32> (tile_shape=32, 32)
+/// -- Upstream pattern generates:
+/// %tmp1 = vector.multi_reduction %tile0, %zero_acc <32,32> to <32>
+/// %res = vector.multi_reduction %tmp1, %tile1 <32,32> to <32>
+/// -- This pattern generates:
+/// %tmp1 = arith.reduction %tile0, %tile1 <32,32> -> <32x,2> // elementwise
+/// %res = vector.multi_reduction %tmp1, %zero_acc <32,32> to <32>
+///
+/// The patterns supports any-D vectors but only handles the case where there
+/// is a single reduction dimension that is the innermost dim.
----------------
Jianhui-Li wrote:

● The tree reduction structure is unnecessary here. Both sequential reduction (tile0 + 
 tile1 → tmp; tmp + tile2 → ...) and tree reduction ((tile0 + tile1), (tile2 + tile3) → 
  ...) require exactly N-1 elementwise operations. There is no saving in total step count —
   and since these ops execute sequentially on the hardware, there is no latency benefit
  either.

  The real issue with the upstream unrolling is the order of operations: it performs the
  in-tile reduction first (collapsing the reduction dimension within each tile via
  multi_reduction), then chains the cross-tile accumulation. This is expensive on GPU because
  in-tile reduction involves cross-lane shuffles, and doing it N times multiplies that
  cost.

  The fix should be simpler:

  1. Cross-tile reduction first: collect all tiles along the reduction dimension and
  combine them with elementwise arith ops (cheap, no shuffles). This collapses N tiles into
   a single tile per position.
  2. In-tile reduction last: perform one multi_reduction per position to eliminate the
  reduction dimension. This is the only step that requires cross-lane shuffles, and it
  happens exactly once per output tile.

  For multiple reduction dimensions, the same principle applies: reduce each reduction
  dimension down to tile size using elementwise ops first, then perform the final in-tile
  reduction to remove those dimensions entirely. In this way, we can also remove the 
 limitation of this PR (only handling the innermost dim reduction). 

https://github.com/llvm/llvm-project/pull/198307