[Mlir-commits] [mlir] [MLIR][SCF] Add an API to fuse consumer to a producer within scf loop (PR #88712)

Wed May 29 08:37:52 PDT 2024

================
@@ -160,6 +215,21 @@ struct LinalgOpTilingInterface
     return success();
   }
 
+  FailureOr<TilingResult> getTiledImplementationFromOperandTile(
+      Operation *op, OpBuilder &b, unsigned operandNumber,
+      ArrayRef<OpFoldResult> offsets, ArrayRef<OpFoldResult> sizes) const {
+    SmallVector<OpFoldResult> mappedOffsets, mappedSizes;
+    auto tilingInterfaceOp = cast<TilingInterface>(op);
+    if (failed(tilingInterfaceOp.getIterationDomainTileFromOperandTile(
+            b, operandNumber, offsets, sizes, mappedOffsets, mappedSizes))) {
+      return emitError(
+          op->getLoc(),
+          "unable to obtain the iter domain position of the operation.");
+    }
+    return tilingInterfaceOp.getTiledImplementation(b, mappedOffsets,
+                                                    mappedSizes);
----------------
ftynse wrote:

> I'm not sure about the current implementation, but I could see this being the case for tensor.pad. Tiling based on the operand must produce the entire padded tensor near the boundaries (and thus can always tile to another tensor.pad), but tiling based on the full iteration space of the result could instead produce scf.if { } else to avoid dynamically zero-sized tensors. In other words, tiling based on the operand could be a restricted subset of tiling for the whole operation and thus want a different implementation.

I don't completely follow this. I suppose I don't understand what you mean by "tiling based on the operand must produce the entire padded tensor" here. What this seems to imply is that a tileable operation does not necessarily tile to a smaller copy of itself, so a `tensor.pad` could, e.g., tile to a `linalg.fill` for the result slice that entirely consists of padded values. 

I would argue that it is still possible to achieve this with two separate methods as described above, both of which would emit `scf.if` with conditions based on the position in the iteration space, but that could result in suboptimal IR if the canonicalizer / DCE is not invoked. Still strange to me that we should push the burden on the interface implementation instead of, well, improving the canonicalizer, but at least makes some sense conceptually.

Thanks for the example by the way. If the somebody had provided that in response to my asking about the method being a trivial composition of two other methods here https://github.com/llvm/llvm-project/pull/88712/files#r1604607877, it would have saved a lot of time.

https://github.com/llvm/llvm-project/pull/88712