[Mlir-commits] [mlir] [mlir] Canonicalize tensor.extract_slice (linalg.fill) (PR #112619)

Mon Oct 28 08:13:37 PDT 2024

banach-space wrote:

> @banach-space how about this change for this example
> 
> ```
> %empty = tensor.empty
> %extract_slice1 = tensor.extract_slice %empty
> %extract_slice2 = tensor.extract_slice %empty
> %res1 = linalg.fill (...) outs(%extract_slice1)
> %res2 = linalg.fill (...) outs(%extract_slice2)
> ```
> 
> That seems like that should fall in the always good to do category. Again I am not advocating one way or another, just thinking aloud.

I see 4 Ops _before_ and 5 Ops _after_ , so it's not immediately obvious to me that _AFTER_ is better :) What metric do you use to decide what's better/worse? Not claiming my metric is perfect, just sharing how I reason about these things.

I tried this simple experiment to better understand the tradeoffs (from my biased viewpoint):
```mlir
// 4 Ops - empty + fill + extract_slice + extract_slice
func.func @foo() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
  %c0 = arith.constant 0. : f32
  %0 = tensor.empty() : tensor<2x1920x66x66xf32>
  %1 = linalg.fill ins(%c0 : f32) outs(%0 : tensor<2x1920x66x66xf32>) -> tensor<2x1920x66x66xf32>
  %extracted_slice = tensor.extract_slice %1[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  %extracted_slice_2 = tensor.extract_slice %1[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  return %extracted_slice,%extracted_slice_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}

// 5 Ops - empty + extract_slice + extract_slice + fill + fill
func.func @bar() -> (tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>) {
  %c0 = arith.constant 0. : f32
  %0 = tensor.empty() : tensor<2x1920x66x66xf32>
  %extracted_slice = tensor.extract_slice %0[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  %extracted_slice_2 = tensor.extract_slice %0[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : tensor<2x1920x66x66xf32> to tensor<2x1920x64x66xf32>
  %res_1 = linalg.fill ins(%c0 : f32) outs(%extracted_slice : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
  %res_2 = linalg.fill ins(%c0 : f32) outs(%extracted_slice_2 : tensor<2x1920x64x66xf32>) -> tensor<2x1920x64x66xf32>
  return %res_1, %res_2 : tensor<2x1920x64x66xf32>, tensor<2x1920x64x66xf32>
}
```

After bufferization (`-one-shot-bufferize="bufferize-function-boundaries"`):
```mlir
  func.func @foo() -> (memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
    linalg.fill ins(%cst : f32) outs(%alloc : memref<2x1920x66x66xf32>)
    %subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
    %subview_0 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
    %cast = memref.cast %subview : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    %cast_1 = memref.cast %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    return %subview, %subview_0 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
  }
  func.func @bar() -> (memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %alloc = memref.alloc() {alignment = 64 : i64} : memref<2x1920x66x66xf32>
    %subview = memref.subview %alloc[0, 0, 0, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1]>>
    %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<2x1920x64x66xf32>
    %subview_1 = memref.subview %alloc[0, 0, 2, 0] [2, 1920, 64, 66] [1, 1, 1, 1] : memref<2x1920x66x66xf32> to memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
    linalg.fill ins(%cst : f32) outs(%alloc_0 : memref<2x1920x64x66xf32>)
    linalg.fill ins(%cst : f32) outs(%subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>)
    %cast = memref.cast %alloc_0 : memref<2x1920x64x66xf32> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    %cast_2 = memref.cast %subview_1 : memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>> to memref<2x1920x64x66xf32, strided<[?, ?, ?, ?], offset: ?>>
    return %alloc_0, %subview_1 : memref<2x1920x64x66xf32>, memref<2x1920x64x66xf32, strided<[8363520, 4356, 66, 1], offset: 132>>
  }
```

So, with 5 Ops there's one additional big allocation and an additional FillOp. Which is not ideal. However, I appreciate that a lot can happen between the proposed canonicalization and bufferization (i.e. my example is a bit academic).

> I think you should avoid using a tensor.empty for the destination of the fill op. It is better just using a slice of a destination.

Could you elaborate? I'd like to better follow the rationale here.

Just to clarify, I am not against this change. And it should be OK to keep it as canonicalization - no semantic context is lost, matching should be similar. IIUC, you are trying to get rid of `tensor.extract_slice`? I am trying to better grasp what is it that we are aiming for 😅 

https://github.com/llvm/llvm-project/pull/112619