[Mlir-commits] [mlir] [mlir][Linalg] Deprecate `linalg::tileToForallOp` and `linalg::tileToForallOpUsingTileSizes` (PR #91878)

Fri Jun 14 00:08:52 PDT 2024

MaheshRavishankar wrote:

> After deeper investigation, I see, the tests that change are not the ones with "num_threads", all changed tests are only related to the classical "tile_sizes" case...
> 
> Before this change, when tiling with tile sizes, we used to call:
> 
> ```
> return tileToForallOpImpl(b, op, numThreads,
>                              /*nominalTileSizes=*/tileSizes, mapping,
>                              /*omitTileOffsetBoundsCheck=*/true);
> ```
> 
> I.e. we explicitly omitted the affine.max because we statically know that tiling with tile sizes does not require it. This is the same reasoning for tiling with scf.for: we do not emit affine.max on the lower bound.
> 
> This change introduces affine.max in "regular tiling" with tile sizes and I believe this is redundant and unnecessary.
> 
> @MaheshRavishankar coming back to you point about:
> 
> ```
> Re 3 : I dont think there is any extra divisions. Most of the lit tests stay the same. The only ones that changed are the dynamic shape tests which I think were originally wrong (You cant really avoid generating the max operation).
> ```
> 
> Can you explain why you think they were originally wrong? My position is that you can avoid generating the max operation when tiling scf.forall with tile sizes for the same reason that you don't generate affine.max when tiling scf.for ..
> 
> Edit: this is the minimal test that now introduces an affine.max that was not here before:
> 
> ```
> func.func @matmul_tile_size_dynamic(%A: tensor<?x?xf32>, %B: tensor<?x?xf32>, %C: tensor<?x?xf32>) -> tensor<?x?xf32> {
>   %0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
>                     outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)
>   return %0 : tensor<?x?xf32>
> }
> 
> module attributes {transform.with_named_sequence} {
>   transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
>     %0 = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!transform.any_op) -> !transform.any_op
>     %sz = transform.param.constant 10 : i64 -> !transform.param<i64>
>     %1:2 = transform.structured.tile_using_forall %0 tile_sizes [%sz, 20]
>            : (!transform.any_op, !transform.param<i64>) -> (!transform.any_op, !transform.any_op)
>     transform.yield
>   }
> }
> ```
> 
> I tried to replace `transform.structured.tile_using_forall` with `transform.structured.tile_using_for` on your branch to confirm we have a different behavior (i.e. that we do not produce the `affine.max` in the `scf.for` case) but this results in a crash in TileUsingForOp (I don't know if introduced by this PR, did not have time to investigate more):
> 
> ```
> #11 0x00007f60582315d8 mlir::transform::TileUsingForOp::apply(mlir::transform::TransformRewriter&, mlir::transform::TransformResults&, mlir::transform::TransformState&) /home/nico/llvm-project/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp:2801:7
> #12 0x00007f6058189ca1 mlir::transform::detail::TransformOpInterfaceInterfaceTraits::Model<mlir::transform::TileUsingForOp>::apply(mlir::transform::detail::TransformOpInterfaceInterfaceTraits::Concept const*, mlir::Operation*, mlir::transform::TransformRewriter&, mlir::transform::TransformResults&, mlir::transform::TransformState&) /home/nico/llvm-project/build-Debug/tools/mlir/include/mlir/Dialect/Transform/Interfaces/TransformInterfaces.h.inc:477:56
> #13 0x00007f604995a0fe mlir::transform::TransformOpInterface::apply(mlir::transform::TransformRewriter&, mlir::transform::TransformResults&, mlir::transform::TransformState&) /home/nico/llvm-project/build-Debug/tools/mlir/include/mlir/Dialect/Transform/Interfaces/TransformInterfaces.cpp.inc:61:14
> ```

@nicolasvasilache  I updated the change, now all the lit tests are consistent with move to using `scf::tileUsingSCF`. With regard to the small test above, it works for me with this output
```
#map = affine_map<()[s0] -> (s0 ceildiv 10)>
#map1 = affine_map<()[s0] -> (s0 ceildiv 20)>
#map2 = affine_map<(d0) -> (d0 * 10)>
#map3 = affine_map<(d0) -> (d0 * 20)>
#map4 = affine_map<(d0)[s0] -> (d0 * -10 + s0, 10)>
#map5 = affine_map<(d0)[s0] -> (d0 * -20 + s0, 20)>
module {
  func.func @matmul_tile_size_dynamic(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
    %dim_1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
    %0 = affine.apply #map()[%dim]
    %1 = affine.apply #map1()[%dim_1]
    %2 = scf.forall (%arg3, %arg4) in (%0, %1) shared_outs(%arg5 = %arg2) -> (tensor<?x?xf32>) {
      %3 = affine.apply #map2(%arg3)
      %4 = affine.apply #map3(%arg4)
      %5 = affine.min #map4(%arg3)[%dim]
      %6 = affine.min #map5(%arg4)[%dim_1]
      %extracted_slice = tensor.extract_slice %arg0[%3, 0] [%5, %dim_0] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
      %extracted_slice_2 = tensor.extract_slice %arg1[0, %4] [%dim_0, %6] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
      %extracted_slice_3 = tensor.extract_slice %arg5[%3, %4] [%5, %6] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
      %7 = linalg.matmul ins(%extracted_slice, %extracted_slice_2 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%extracted_slice_3 : tensor<?x?xf32>) -> tensor<?x?xf32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %7 into %arg5[%3, %4] [%5, %6] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
      }
    }
    return %2 : tensor<?x?xf32>
  }
  module attributes {transform.with_named_sequence} {
    transform.named_sequence @__transform_main(%arg0: !transform.any_op {transform.readonly}) {
      %0 = transform.structured.match ops{["linalg.matmul"]} in %arg0 : (!transform.any_op) -> !transform.any_op
      %1 = transform.param.constant 10 : i64 -> !transform.param<i64>
      %tiled_op, %forall_op = transform.structured.tile_using_forall %0 tile_sizes [%1, 20] : (!transform.any_op, !transform.param<i64>) -> (!transform.any_op, !transform.any_op)
      transform.yield
    }
  }
}
```

And no `affine.max` as you pointed out.

https://github.com/llvm/llvm-project/pull/91878