[Mlir-commits] [mlir] [mlir] Canonicalize tensor.extract_slice (linalg.fill) (PR #112619)
Andrzej Warzyński
llvmlistbot at llvm.org
Mon Oct 28 08:13:37 PDT 2024
banach-space wrote:
> @banach-space how about this change for this example
>
> ```
> %empty = tensor.empty
> %extract_slice1 = tensor.extract_slice %empty
> %extract_slice2 = tensor.extract_slice %empty
> %res1 = linalg.fill (...) outs(%extract_slice1)
> %res2 = linalg.fill (...) outs(%extract_slice2)
> ```
>
> That seems like that should fall in the always good to do category. Again I am not advocating one way or another, just thinking aloud.
I see 4 Ops _before_ and 5 Ops _after_ , so it's not immediately obvious to me that _AFTER_ is better :) What metric do you use to decide what's better/worse? Not claiming my metric is perfect, just sharing how I reason about these things.
I tried this simple experiment to better understand the tradeoffs (from my biased viewpoint):
```mlir
// 4 Ops - empty + fill + extract_slice + extract_slice
func.func @foo() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
%c0 = arith.constant 0. : f32
%0 = tensor.empty() : tensor<2x1920x66x66xf32>
%1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
%extracted_slice = tensor.extract_slice %1[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
%extracted_slice_2 = tensor.extract_slice %1[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
return %extracted_slice,%extracted_slice_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}
// 5 Ops - empty + extract_slice + extract_slice + fill + fill
func.func @bar() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
%c0 = arith.constant 0. : f32
%0 = tensor.empty() : tensor<2x1920x66x66xf32>
%extracted_slice = tensor.extract_slice %0[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
%extracted_slice_2 = tensor.extract_slice %0[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
%res_1 = linalg.fill ins(%c0 : f32) outs(%extracted_slice : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
%res_2 = linalg.fill ins(%c0 : f32) outs(%extracted_slice_2 : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
return %res_1, %res_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}
```
After bufferization (`-one-shot-bufferize="bufferize-function-boundaries"`):
```mlir
func.func @foo() -> (memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
linalg.fill ins(%cst : f32) outs(%alloc : memref<2x1920x66x66xf32>)
%subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
%subview_0 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
%cast = memref.cast %subview : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
%cast_1 = memref.cast %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
return %subview, %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
}
func.func @bar() -> (memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
%cst = arith.constant 0.000000e+00 : f32
%alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
%subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
%alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x1920x64x66xf32>
%subview_1 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
linalg.fill ins(%cst : f32) outs(%alloc_0 : memref<2x1920x64x66xf32>)
linalg.fill ins(%cst : f32) outs(%subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>)
%cast = memref.cast %alloc_0 : memref<2x1920x64x66xf32> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
%cast_2 = memref.cast %subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
return %alloc_0, %subview_1 : memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
}
```
So, with 5 Ops there's one additional big allocation and an additional FillOp. Which is not ideal. However, I appreciate that a lot can happen between the proposed canonicalization and bufferization (i.e. my example is a bit academic).
> I think you should avoid using a tensor.empty for the destination of the fill op. It is better just using a slice of a destination.
Could you elaborate? I'd like to better follow the rationale here.
Just to clarify, I am not against this change. And it should be OK to keep it as canonicalization - no semantic context is lost, matching should be similar. IIUC, you are trying to get rid of `tensor.extract_slice`? I am trying to better grasp what is it that we are aiming for 😅
https://github.com/llvm/llvm-project/pull/112619
More information about the Mlir-commits
mailing list