[Mlir-commits] [mlir] [mlir][scf] Extend fuse producer to multi-level candidates case (PR #97803)

Wed Sep 18 00:16:19 PDT 2024

================
@@ -949,6 +949,145 @@ mlir::scf::tileAndFuseProducerOfSlice(
                                            tileAndFuseResult->tiledOps};
 }
 
+/// Get the real producer from candidate ExtractSliceOp
+///
+/// ```
+/// %0 = producer
+/// %1 = scf.for(%arg1 = %0)
+///   %2 = extract %arg1
+///   %3 = scf.for(%arg2 = %2)
+///      %4 = extract %args2
+///      ...
+/// ```
+///
+/// @param candidateSliceOp: %4 = extract %args2
+/// @param backwardSlice: in-out parameter populated by backward extractSliceOps
+/// @return OpResult Producer : %0 = producer
+static FailureOr<OpResult> getRealProducerFromExtractSliceOp(
----------------
Yun-Fly wrote:

Hi, @MaheshRavishankar. Since PR #108318 has been broken down into three parts:
1. support single nested `scf.for` (merged, great thanks for your review!)
2. multi-level candidates
3. multiple users

I think the second part is similar to this patch, no matter producer or consumer fusion. In general, this patch focus on how to deal with consecutive candidates **interleaved by a `scf.for`**.

Lets look back what the major argument remains so far to speed up our next stage of review:
1. why need to look through the total chain of candidate slice?
2. why not merge two consecutive candidates?

The answer is listed in above several threads. The root cause behind are two different solutions to solve multi-level candidates cases:
1. merge consecutive candidates in advance.
2. iteratively fuse producer/consumer with one of candidates step by step(from outer to inner).

IMO, the most concern about the first solution is that it may introduce too many additional transform regarding `scf.for`. E.g..

```
%0 = producerOp
scf.for(%1=%0)
   %2 = extract_slice %1
   scf.for(%3=%2)
      %4 = extract_slice %3
      ... 
      yield %x
```

As I explained above, this scenario is a little different from what [`MergeConsecutiveExtractSlice`](https://github.com/iree-org/llvm-project/blob/c9f6518f742c88bda309d5331e0a5d4664387f94/mlir/lib/Dialect/Tensor/Transforms/MergeConsecutiveInsertExtractSlicePatterns.cpp#L23) expects due to `scf.for`. 

If we just merge consecutive extract slice(BTW, here we also need look through to penetrate `scf.for` ), the possible resultant IR looks like below:
```
%0 = producerOp
scf.for(%1=%0)
   %2 = extract_slice %1
   scf.for(%3=%2)
      %4 = extract_slice %1 [newOffsets] [newSizes]
      ... 
      yield %x
```
Then the problem is that how we modify new `inits` of `scf.for` with latest `dpsInit` of fused `producerOp`? ([`getUntiledProducerFromSliceSource`](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp#L1058) will fail to find `destinationIterArg`)

On the other hand, if we modify `%3=%2` at the same time, the resultant IR would become:
```
%0 = producerOp
scf.for(%1=%0)
   %2 = extract_slice %1
   scf.for(%3=%1)
      %4 = extract_slice %3 [newOffsets] [newSizes]
      ... 
      yield %x_new
```
First of all, this transform has already introduced more code change and there is an assumption that `%3` has only single one user. Moreover, considering the semantic of `scf.for`, it seems that we also need to modify yield value `%x`(and even its producer, like `insert_slice`) to make the result of `scf.for` matched with its `inits` . 

In summary, I am afraid that this kind of complexity is much more than the second solution based on simple iterative process of reusing existed function.


https://github.com/llvm/llvm-project/pull/97803