[Mlir-commits] [mlir] [mlir][PartialReductionTilingInterface] Add support for `ReductionTilingStrategy::PartialReductionOuterParallel` in `tileUsingSCF`. (PR #143988)

Sun Jun 22 21:19:00 PDT 2025

================
@@ -348,28 +359,79 @@ getPartialResultAffineMaps(LinalgOp linalgOp,
   return partialReductionMaps;
 }
 
-/// Return the slice of the `initValue` to use as input to the partial reduction
-/// op generated.
-static Operation *getInitSliceForOuterReduction(
-    OpBuilder &b, Location loc, Value initValue, ArrayRef<OpFoldResult> offsets,
+struct InitSliceInfo {
+  SmallVector<int64_t> resultShape;
+  SmallVector<OpFoldResult> offsets;
+  SmallVector<OpFoldResult> sizes;
+  SmallVector<OpFoldResult> strides;
+};
+
+/// Return the result type, offsets, sizes and strides of the slice of the
+/// `initValue` to use as input to the partial reduction op generated with
+/// outer reduction strategy.
+static InitSliceInfo getInitSliceInfoForOuterReduction(
+    MLIRContext *context, ArrayRef<OpFoldResult> offsets,
     ArrayRef<OpFoldResult> sizes, const SetVector<unsigned> &reductionDims,
     AffineMap partialReductionMap) {
   int64_t initRank = partialReductionMap.getNumResults();
   SmallVector<OpFoldResult> initOffsets, initSizes;
-  SmallVector<OpFoldResult> initStrides(initRank, b.getIndexAttr(1));
+  Attribute zero = IntegerAttr::get(IndexType::get(context), 0);
+  Attribute one = IntegerAttr::get(IndexType::get(context), 1);
+  SmallVector<OpFoldResult> initStrides(initRank, one);
   for (AffineExpr dimExpr : partialReductionMap.getResults()) {
     unsigned dim = cast<AffineDimExpr>(dimExpr).getPosition();
     if (reductionDims.contains(dim)) {
-      initOffsets.push_back(b.getIndexAttr(0));
+      initOffsets.push_back(zero);
     } else {
       initOffsets.push_back(offsets[dim]);
     }
     initSizes.push_back(sizes[dim]);
   }
-  // TODO: Use SubsetExtractOpInterface here once available.
-  auto extractSlice = b.create<tensor::ExtractSliceOp>(
-      loc, initValue, initOffsets, initSizes, initStrides);
-  return extractSlice;
+  SmallVector<int64_t> resultShape;
+  std::tie(resultShape, std::ignore) = decomposeMixedValues(initSizes);
+  return {resultShape, initOffsets, initSizes, initStrides};
+}
+
+/// Return the result type, offsets, sizes and strides of the slice of the
+/// `initValue` to use as input to the partial reduction op generated with
+/// outer parallel strategy.
+static InitSliceInfo getInitSliceInfoForOuterParallel(
+    MLIRContext *context, ValueRange ivs, ArrayRef<OpFoldResult> offsets,
+    ArrayRef<OpFoldResult> sizes, const SetVector<unsigned> &reductionDims,
+    AffineMap partialReductionMap) {
+  int64_t initRank = partialReductionMap.getNumResults();
+  SmallVector<OpFoldResult> initOffsets, initSizes;
+  Attribute one = IntegerAttr::get(IndexType::get(context), 1);
+  SmallVector<OpFoldResult> initStrides(initRank, one);
+  SmallVector<OpFoldResult> resultShape;
+  for (AffineExpr dimExpr : partialReductionMap.getResults()) {
+    unsigned dim = cast<AffineDimExpr>(dimExpr).getPosition();
+    if (std::optional<int> dimPos = getPositionIn(reductionDims, dim)) {
+      initOffsets.push_back(ivs[dimPos.value()]);
----------------
MaheshRavishankar wrote:

It is unfortunate `ivs` is needed. The reason is https://github.com/llvm/llvm-project/blob/b7d0c9b9d8e2b5c5d6677e368e3cdaf438df294e/mlir/test/Dialect/Linalg/transform-tile-reduction.mlir#L96.

Basically say you start with

```
 %red = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
                                          affine_map<(d0, d1) -> (d0)>],
   iterator_types = ["parallel", "reduction"]}
   ins(%arg0 : tensor<?x?xf32>)
   outs(%out : tensor<?xf32>) {
    ^bb0(%arg7: f32, %arg9: f32):
      %1 = arith.mulf %arg7, %arg7 : f32
      %2 = arith.addf %1, %arg9 : f32
      linalg.yield %2 : f32
    } -> tensor<?xf32>
  return %red : tensor<?xf32>
```


```
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c5 = arith.constant 5 : index
%cst = arith.constant 0.0 : f32
%d0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
%d1 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
%intermediate = tensor.empty(%d0, %c5) : tensor<?x5xf32>
%fill = linalg.fill ins(%cst : f32) outs(%intermediate : tensor<?x5xf32>) -> tensor<?x5xf32>
%0 = scf.forall (%iv) = (0) to (%d1) step (5) shared_outs(%init = %intermediate) -> tensor<?x5xf32> {
  %tileSize = affine.min affine_map<(d0)[s0] -> (s0 - d0, s1)>(%iv)[%d1]
  %initOffset = affine.apply affine_map<()[s0] -> (s0 ceildiv 5)>()[%arg2]
  %arg1 = tensor.extract_slice %arg0[0, %iv] [%d0, %tileSize] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
  %init_slice = tensor.extract_slice %init[%c0, %initOffset] [%d0, 1] [1, 1] : tensor<?x?xf32> to tensor<?xf32>
  ....
}
```

Now `offsets` is `%iv` above. To index into the `%intermediate` tensor (that is passed in as `%init` into the loop and sliced to get `%init_slice`) you need `%initOffset` which is induction variable divided by the original tile size.

Further `scf.forall` allows for specifying the number of threads and not the tile size (side note: I think we should drop that mode, it is more confusing adds way too much complexity without much benefit afaics). If you specify 5 threads you get

```
%0 = scf.forall (%iv) in (5) shared_outs(%init = %intermediate) -> tensor<?x5xf32> {
  %tileSize = affine.apply affine_map<()[s0] -> (s0 ceildiv 5)>()[%d1]
  %offset = affine.apply affine_map<(d0)[s0] -> (d0 * 5)>(%iv)
  %tileSizeBounded = affine.min affine_map<(d0)[s0, s1] -> (s1 - d0, s0)>(%offset)[%tileSize, %d1]
  %tilzeSizeNonZero = affine.max <(d0) -> (0, d0)>(%tileSizeBounded)
  %arg1 = tensor.extract_slice %arg0[0, %offset] [%d0, %tileSizeBounded] [1, 1] 
  %init_slice = tensor.extract_slice %init[0, %iv] [%d0, 1] [1, 1] : tensor<?x?xf32> to tensor<?xf32>
```

Here you need to index by `%iv`. 

But your comment made me realize, (1) I had a bug in my implementation for the case where the distribution was done through tile size and not number of threads. Fixed that now and added tests. Also changed the interface to not use induction variable, but you still need to know for the `PartialReductionOuterParallel` what the "iteration ID" is. So plumbed that through now.

This is a bit involved, but I am happy to talk about this offline.



https://github.com/llvm/llvm-project/pull/143988