[Mlir-commits] [mlir] [MLIR][Linalg] Fix insert_slice fusion with rank reduction (PR #130961)

Wed May 7 04:50:22 PDT 2025

RoboTux wrote:

> > Thanks for the updates, Thomas!
> > I've left a few smallish comments, but before addressing those, it would be good to make sure that we do indeed need `tensor.collapse_shape` (as opposed to _a rank reducing_ `tensor.extract_slice`). I am sure that you thought it through and there's a good reason to insert `collapse_shape`, but it's not obvious to me just yet 😅
> 
> That's a very good question and I'll need to get back to you before I can answer fully. To be clear, the fusion does not insert a new extract_slice, this is done by another transform. I don't even think it was there in the earlier example and I don't know why there's a new linalg as well so I will look into understanding that. Normally there is an extract slice that might happen to collapse some unit dimension which then causes a type mismatch. This patch is to ensure we do the proper collapse_shape after the extract_slice is fused to reproduce that unit dimension collapsing and make the type match. By the way the if is because the pass requires the producing linalg and consuming linalg (the one that consume the extract_slice) to be in different block, the bug I was facing was a case of loop but I used a if here to make the code simpler.

This is answered in my other comment: https://github.com/llvm/llvm-project/pull/130961#discussion_r2077407969 but I'll rephrase it here in case that comment wasn't clear enough.

First of all, the fusion is worth doing in case of rank-reducing extract_slice even if an expand_shape is involved because the producer is moved inside the loop and operate directly on rank-reduced buffers. The only case it wouldn't be worth it is if the extract_slice was only used to do rank-reduction but that would be strange IR in my opinion, a collapse_shape should have been used instead in that case.

Second the collapse_shape is used because it would add a lot of complexity to fold the rank-reduction in the extract_slice operations of the producer's operands. This is because the producing linalg could be something like:

```
#map0 = affine_map<(d0, d1) -> (d0, d1)
#map1 = affine_map<(d0, d1) -> (d0, d0 + d1)
linalg.generic
    {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel"]}
    ins(%0, %1 : tensor<1x4xf32>, tensor <1x4xf32>) outs(%2 : tensor<1x4>) {
^bb0(%in : f32, %in1 : f32):
    %result = arith.addf %in, %in1 : f32
    linalg.yield %result : f32
}
// followed by loop with consumer
```

If the dimension #1 was used in a loop of step 2 and dimension #0 was rank-reduced, one would at the very least need to removed the dropped dimension from the indexing_maps as follows:

```
#map0 = affine_map<(d1) -> (d1)
#map1 = affine_map<(d1) -> (0 + d1)

// following is inside the loop where the consumer is
%2 = tensor.empty() : tensor<2xf32>
%3 = tensor.extract_slice %0 into %2[1,%iv][1,2][1,1] : tensor<2xf32>
%4 = tensor.empty() : tensor<2xf32>
%5 = tensor.extract_slice %1 into %4[1,%iv][1,2][1,1] : tensor<2xf32>
linalg.generic
    {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel"]}
    ins(%3, %5 : tensor<2xf32>, tensor <2xf32>) outs(%2 : tensor<2xf32>) {
^bb0(%in : f32, %in1 : f32):
    %result = arith.addf %in, %in1 : f32
    linalg.yield %result : f32
}
```

However the pass does not operate on GenericOp but on LinalgOp which does not allow modifying indexing maps since it can be an interface for a named linalg op with an implicit indexing maps. So one would need to distinguish between GenericOp and other LinalgOp. I also feel that this is a further optimization to fold the rank-reducing earlier, better suited into a dedicated pass if there isn't already one.

Does that make sense to you?

https://github.com/llvm/llvm-project/pull/130961