[Mlir-commits] [mlir] Refactor LoopFuseSiblingOp and support parallel fusion (PR #94391)

Tue Jul 2 20:44:01 PDT 2024

qedawkins wrote:

This patch appears to be introducing some dominance errors, e.g.
```mlir
#map = affine_map<(d0) -> (d0 * 32)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
module {
  func.func @loop_sibling_fusion(%arg0: tensor<128xf32>, %arg1: tensor<128x128xf16>, %arg2: tensor<128x64xf32>, %arg3: tensor<128x128xf32>) {
    %0 = scf.forall (%arg4) in (4) shared_outs(%arg5 = %arg0) -> (tensor<128xf32>) {
      %3 = affine.apply #map(%arg4)
      %extracted_slice = tensor.extract_slice %arg3[%3, 0] [32, 1] [1, 1] : tensor<128x128xf32> to tensor<32xf32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %extracted_slice into %arg5[%3] [32] [1] : tensor<32xf32> into tensor<128xf32>
      }   
    } {mapping = [#gpu.warp<linear_dim_0>]}
    %1 = tensor.empty() : tensor<128x128xf16>
    %2 = scf.forall (%arg4) in (4) shared_outs(%arg5 = %arg1) -> (tensor<128x128xf16>) {
      %3 = affine.apply #map(%arg4)
      %extracted_slice = tensor.extract_slice %arg3[%3, 0] [32, 128] [1, 1] : tensor<128x128xf32> to tensor<32x128xf32>
      %extracted_slice_0 = tensor.extract_slice %1[%3, 0] [32, 128] [1, 1] : tensor<128x128xf16> to tensor<32x128xf16>
      %4 = linalg.generic {indexing_maps = [#map1, #map1], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<32x128xf32>) outs(%extracted_slice_0 : tensor<32x128xf16>) {
      ^bb0(%in: f32, %out: f16):
        %5 = arith.truncf %in : f32 to f16 
        linalg.yield %5 : f16 
      } -> tensor<32x128xf16>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %4 into %arg5[%3, 0] [32, 128] [1, 1] : tensor<32x128xf16> into tensor<128x128xf16>
      }   
    } {mapping = [#gpu.warp<linear_dim_0>]}
    return
  }
}

module attributes { transform.with_named_sequence } { 
  transform.named_sequence @__transform_main(%root: !transform.any_op) {
    %loops = transform.structured.match ops{["scf.forall"]} in %root : (!transform.any_op) -> !transform.any_op
    %loop1, %loop2 = transform.split_handle %loops : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
    %loop3 = transform.loop.fuse_sibling %loop1 into %loop2 : (!transform.any_op, !transform.any_op) -> !transform.any_op
    transform.yield
  }
}
```
with command `mlir-opt repro.mlir -transform-interpreter` is giving the following output for me
```
repro.mlir:16:28: error: operand #0 does not dominate this use
      %extracted_slice_0 = tensor.extract_slice %1[%3, 0] [32, 128] [1, 1] : tensor<128x128xf16> to tensor<32x128xf16>
                           ^
repro.mlir:16:28: note: see current operation: %6 = "tensor.extract_slice"(%1, %4) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: 32, 128>, static_strides = array<i64: 1, 1>}> : (tensor<128x128xf16>, index) -> tensor<32x128xf16>
repro.mlir:12:10: note: operand defined here (op in a parent region)
    %1 = tensor.empty() : tensor<128x128xf16>
         ^
```

Dumping the generic IR reveals that the implicitly captured `tensor.empty` is somehow ending up after the fused loop: https://gist.github.com/qedawkins/11a10aeefef6168829f6bffdf949fe1e

I'm guessing there is an insertion point bug introduced somewhere in this refactor. (Initially observed downstream [here](https://github.com/iree-org/iree/pull/17799))

https://github.com/llvm/llvm-project/pull/94391