[Mlir-commits] [mlir] [mlir][scf] Extend fuse producer to multi-level candidates case (PR #97803)

Thu Sep 19 00:28:59 PDT 2024

================
@@ -949,6 +949,145 @@ mlir::scf::tileAndFuseProducerOfSlice(
                                            tileAndFuseResult->tiledOps};
 }
 
+/// Get the real producer from candidate ExtractSliceOp
+///
+/// ```
+/// %0 = producer
+/// %1 = scf.for(%arg1 = %0)
+///   %2 = extract %arg1
+///   %3 = scf.for(%arg2 = %2)
+///      %4 = extract %args2
+///      ...
+/// ```
+///
+/// @param candidateSliceOp: %4 = extract %args2
+/// @param backwardSlice: in-out parameter populated by backward extractSliceOps
+/// @return OpResult Producer : %0 = producer
+static FailureOr<OpResult> getRealProducerFromExtractSliceOp(
----------------
Yun-Fly wrote:

> I dont know if this has been rebased on top of what was submitted.

Not rebased yet, because I think it is more import to reach agreement on solution before pushing...

> you need to somewhere keep track of the sequence of extract slices that you need to walk to get the producer because the actual offset and size you need is obtained by "combining" the offsets and sizes of all the slices to get the "real offset" and size.

Yes, from the view of consequentialism, the real offset is indeed combined. However, there are something else we need to address, like what I explained above regarding loop transform. In this way, each two of extract_slice  (interleaved by a `scf.for`) merged, the certain `scf.for`(and its yieldValue, etc...) need transformed. Is the amount of state need to carry similar to what I do in iterative fashion?

> Also I am not convinced that fusing the first extract slice + producer and then doing the second extract slice + producer is not feasible. That should be always possible.

It is decided by concrete semantic of tilable op, such as `reduce`/`pack`/`unpack` op. Lets say fusing producer `unpack` into following loop:

```
// unpack tensor from ABab to AB
%1 = tensor.unpack ... inner_tiles = [32, 32] ... : tensor<2x2x32x32xbf16> -> tensor<64x64xbf16>
scf.for(0, 2) { // tile A dimension
 extract_slice1 // tileSize<32, 64>
 scf.for(0, 4) { // tile B dimension
    extract_slice2 // tileSize<32, 16>
    ...
 }
}
```
As you can see, the `tileSize` comes from `extract_slice2` is `<32,16>`, but `tensor.unpack` prefer perfect tiling case, i.e. tileSize should be exactly divided by `inner_tiles`. So, it may be not feasible in this case.

BTW, fusing consumer `reduce` maybe more intuitive: 

```
%0 = scf.for(0, 2) { // tile 128 dimension
 scf.for(0, 4) { // tile B dimension
    ...
   insert_slice1 // tileSize<64, 64>
 }
 insert_slice1 // tileSize<64, 256>
}
%2 = linalg.reduce { arith.addf } ins(%0 : tensor<128x256xf32>) outs(%1: tensor<128xf32>) dimensions = [1]
```
We could not furtherly fuse `reduce` into the inner-most insert slice, otherwise, it will lead to partial reduce(its another topic).


> So if you start with this

> you should always be able to do

> and then you do

Yes, it is exactly what I do now. IIRC, that is also what you suggest before in nested consumer fusion...




https://github.com/llvm/llvm-project/pull/97803