[Mlir-commits] [mlir] Refactor LoopFuseSiblingOp and support parallel fusion (PR #94391)
Quinn Dawkins
llvmlistbot at llvm.org
Tue Jul 2 20:44:01 PDT 2024
qedawkins wrote:
This patch appears to be introducing some dominance errors, e.g.
```mlir
#map = affine_map<(d0) -> (d0 * 32)>
#map1 = affine_map<(d0, d1) -> (d0, d1)>
module {
func.func @loop_sibling_fusion(%arg0: tensor<128xf32>, %arg1: tensor<128x128xf16>, %arg2: tensor<128x64xf32>, %arg3: tensor<128x128xf32>) {
%0 = scf.forall (%arg4) in (4) shared_outs(%arg5 = %arg0) -> (tensor<128xf32>) {
%3 = affine.apply #map(%arg4)
%extracted_slice = tensor.extract_slice %arg3[%3, 0] [32, 1] [1, 1] : tensor<128x128xf32> to tensor<32xf32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %extracted_slice into %arg5[%3] [32] [1] : tensor<32xf32> into tensor<128xf32>
}
} {mapping = [#gpu.warp<linear_dim_0>]}
%1 = tensor.empty() : tensor<128x128xf16>
%2 = scf.forall (%arg4) in (4) shared_outs(%arg5 = %arg1) -> (tensor<128x128xf16>) {
%3 = affine.apply #map(%arg4)
%extracted_slice = tensor.extract_slice %arg3[%3, 0] [32, 128] [1, 1] : tensor<128x128xf32> to tensor<32x128xf32>
%extracted_slice_0 = tensor.extract_slice %1[%3, 0] [32, 128] [1, 1] : tensor<128x128xf16> to tensor<32x128xf16>
%4 = linalg.generic {indexing_maps = [#map1, #map1], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<32x128xf32>) outs(%extracted_slice_0 : tensor<32x128xf16>) {
^bb0(%in: f32, %out: f16):
%5 = arith.truncf %in : f32 to f16
linalg.yield %5 : f16
} -> tensor<32x128xf16>
scf.forall.in_parallel {
tensor.parallel_insert_slice %4 into %arg5[%3, 0] [32, 128] [1, 1] : tensor<32x128xf16> into tensor<128x128xf16>
}
} {mapping = [#gpu.warp<linear_dim_0>]}
return
}
}
module attributes { transform.with_named_sequence } {
transform.named_sequence @__transform_main(%root: !transform.any_op) {
%loops = transform.structured.match ops{["scf.forall"]} in %root : (!transform.any_op) -> !transform.any_op
%loop1, %loop2 = transform.split_handle %loops : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
%loop3 = transform.loop.fuse_sibling %loop1 into %loop2 : (!transform.any_op, !transform.any_op) -> !transform.any_op
transform.yield
}
}
```
with command `mlir-opt repro.mlir -transform-interpreter` is giving the following output for me
```
repro.mlir:16:28: error: operand #0 does not dominate this use
%extracted_slice_0 = tensor.extract_slice %1[%3, 0] [32, 128] [1, 1] : tensor<128x128xf16> to tensor<32x128xf16>
^
repro.mlir:16:28: note: see current operation: %6 = "tensor.extract_slice"(%1, %4) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: 32, 128>, static_strides = array<i64: 1, 1>}> : (tensor<128x128xf16>, index) -> tensor<32x128xf16>
repro.mlir:12:10: note: operand defined here (op in a parent region)
%1 = tensor.empty() : tensor<128x128xf16>
^
```
Dumping the generic IR reveals that the implicitly captured `tensor.empty` is somehow ending up after the fused loop: https://gist.github.com/qedawkins/11a10aeefef6168829f6bffdf949fe1e
I'm guessing there is an insertion point bug introduced somewhere in this refactor. (Initially observed downstream [here](https://github.com/iree-org/iree/pull/17799))
https://github.com/llvm/llvm-project/pull/94391
More information about the Mlir-commits
mailing list