[Mlir-commits] [mlir] [mlir][scf] Extend consumer fuse to nested loop structure (PR #94190)

Wed Jun 5 17:36:11 PDT 2024

Yun-Fly wrote:

Hi, @nicolasvasilache @MaheshRavishankar , try to reply both in one thread.

> this should be done by multiple application of existing transformations

Could you detail more about how to apply multiple existing transformations by an example?

> First tile the consumer
> ....
> then you fuse %0 within the scf.for nest that is created during tiling of consumer to get

1. The difference is the fusion direction: consumer-to-producer or producer-to-consumer. IMO, this is two different but both feasible solution for fusion transform. In general, it should also be functionally enabled and provide an option for users to select  case by case.  I guess what you mean here is `tileConsumerAndFuseProducersUsingSCF` using `tileAndFuseProducerOfSlice`. But, as the counterpart, this patch targets on another technical path `tileAndFuseConsumerOfSlice`, just as same as previous merged [PR](https://github.com/llvm/llvm-project/pull/88712) which does not support nested loop structure currently.
2. From tiling perspective, the major difference between consumer-to-producer or producer-to-consumer is that which one takes higher priority to decide how to partition the tiling size by iteration domain. For instance, if we tile consumer first and then fuse producer as you illustrated:
   a. the tiling size of producer comes from tiled consumer by tiling propagation based on `AffineMap`.
   b. producer has to force itself to fit the iteration domain already generated by consumer, which may bring redundant iteration loop.
3. Based on `2`, a typical use-case where producer-to-consumer maybe more suitable than consumer-to-producer is that `matmul+post-op` fusion. As you known, `matmul` is computation sensitive and many developers have strong demand on hand-writing user-defined template with nested and complex loop to deal with multi-level tile size for peek performance, particularly for  either GPU or CPU. If we start fusion with tiling post-op(like relu), the computation of `matmul` will put up with an elementwise operation. 

Again, this patch is the extension of already merged [PR](https://github.com/llvm/llvm-project/pull/88712) involving producer-to-consumer fusion as well.

CC: @ZhennanQin.

https://github.com/llvm/llvm-project/pull/94190