[Mlir-commits] [mlir] Revert "[mlir][tensor] Refine the semantics of `createPadHighOp`" (PR #110153)

Thu Sep 26 14:47:27 PDT 2024

hanhanW wrote:

> Looking at the output for this example , what's the point of padding here?
> 
> ```mlir
>   func.func @batch_matmul_f16(%arg0: tensor<1x?x1281xf16>, %arg1: tensor<1x1281x?xf16>, %arg2: tensor<1x?x?xf16>) -> tensor<1x?x?xf16> {
>     %cst = arith.constant 0.000000e+00 : f16
>     %c0 = arith.constant 0 : index
>     %0 = linalg.fill ins(%cst : f16) outs(%arg2 : tensor<1x?x?xf16>) -> tensor<1x?x?xf16>
>     %cst_0 = arith.constant 0.000000e+00 : f16
>     %padded = tensor.pad %arg0 nofold low[0, 0, 0] high[0, 0, 0] {
>     ^bb0(%arg3: index, %arg4: index, %arg5: index):
>       tensor.yield %cst_0 : f16
>     } : tensor<1x?x1281xf16> to tensor<1x?x1281xf16>
>     %cst_1 = arith.constant 0.000000e+00 : f16
>     %padded_2 = tensor.pad %arg1 nofold low[0, 0, 0] high[0, 0, 0] {
>     ^bb0(%arg3: index, %arg4: index, %arg5: index):
>       tensor.yield %cst_1 : f16
>     } : tensor<1x1281x?xf16> to tensor<1x1281x?xf16>
>     %c1 = arith.constant 1 : index
>     %dim = tensor.dim %arg0, %c1 : tensor<1x?x1281xf16>
>     %c2 = arith.constant 2 : index
>     %dim_3 = tensor.dim %arg1, %c2 : tensor<1x1281x?xf16>
>     %c1_4 = arith.constant 1 : index
>     %dim_5 = tensor.dim %0, %c1_4 : tensor<1x?x?xf16>
>     %c2_6 = arith.constant 2 : index
>     %dim_7 = tensor.dim %0, %c2_6 : tensor<1x?x?xf16>
>     %1 = linalg.batch_matmul ins(%padded, %padded_2 : tensor<1x?x1281xf16>, tensor<1x1281x?xf16>) outs(%0 : tensor<1x?x?xf16>) -> tensor<1x?x?xf16>
>     %extracted_slice = tensor.extract_slice %1[0, 0, 0] [1, %dim, %dim_3] [1, 1, 1] : tensor<1x?x?xf16> to tensor<1x?x?xf16>
>     %2 = bufferization.materialize_in_destination %extracted_slice in %0 : (tensor<1x?x?xf16>, tensor<1x?x?xf16>) -> tensor<1x?x?xf16>
>     return %2 : tensor<1x?x?xf16>
>   }
> ```
> 
> In this particular case, `tensor.pad` is a NOP - that's because both dynamic dimensions (input and output) are assumed identical. But is that a safe/valid assumption? IMO, it's not. IIUC, `tensor.pad` shouldn't be generated at all. WDYT?

It is technically a NOP, but we still need to generate the pad op. Because there is a `nofold` attribute. This is controlled by `pack_paddings` in the transform op. A scenario of using these transformation is to hint the bufferization to create an intermediate buffer (e.g., stack buffer on CPU's land), and it is sort of preparation for vectorization. In one of use cases, we can `tile parallel loops -> pad parallel dimensions-> tile reduction loops -> pad reduction dimensions` to enable the vectorization for dynamic shapes. The chain of `extract_slice -> pad (foldable) -> extract_slice -> pad (nofold)` can be folded into a single pad op by [FoldOrthogonalPaddings pattern](https://github.com/llvm/llvm-project/blob/c11722223bacf604e60414542743d021a9f13aee/mlir/lib/Dialect/Tensor/IR/TensorOps.cpp#L3147-L3183). I think we want to keep the pass working, so we don't want to break linalg::padAndHoistLinalgOp's behavior.

https://github.com/llvm/llvm-project/pull/110153