[Mlir-commits] [mlir] [mlir][linalg] Add extra_pad_tiles to linalg.pack & unpack (PR #189049)

Tue Mar 31 04:37:45 PDT 2026

fabrizio-indirli wrote:

Hi, thanks for taking a look. I'll try and answer your questions below:

> I need more context to provide guidance. What are you trying to solve?

There can be cases where you pack some input data and then feed it to a compute kernel that **consumes multiple tiles of packed data per iteration**. AFAIK this is a common technique to improve data reuse (and performance) in GEMM kernels. However you may not want to change the packing layout, for example because the tile size because matches some hardware property (vector load size, cache line size ...). So you may want to pack your input `HxW` tensor into a `NumTilesH x NumTilesW x TileSzH x TileSzW` , and then have each iteration of your kernel read a `KerTilesH x KerTilesW x TileSzH x TileSzW` slice. If `KerTilesH >=2` or  `KerTilesW >= 2`, you may need to add extra pad tiles to avoid out-of-bound accesses (assuming you can't use masked loads).

For example, a 15x15 input tensor can be packed with `TileSzH==TileSzW==5` as a 3x3x5x5 packed tensor; but if the consuming kernel reads 2x2 tiles (i.e. a 2x2x5x5 slice) for each iteration, then we'd need to add an extra pad tile in each dimension and obtain a 4x4x5x5 tensor to avoid out-of-bounds accesses.

> This seems like it is overloading the semantics of the op too much. Can this be handled through separate ops.
The first question popped out in my mind is: what is the issue of using "pack -> pad" and "extract_slice -> unpack"?

It can be handled through separate ops, but having the attribute in the same op allows to centralize the handling of the padding in the `linalg.pack` op for all cases, which seems natural given that `linalg.pack` can already handle intra-tile padding. Most lowerings of `linalg.pack` that supported the intra-tile padding should be able to support also the extra tiles without materializing additional ops. And it makes sense to do so from a performance perspective, to avoid dispatching unnecessary additional operations when the entire padding could be handled by the same op.
For example, with the upstream lowering:
```
// Lowering with intra-tile padding (as already supported by the op)
// SOURCE:
%pack = linalg.pack %arg0 padding_value(%cst_0 : f32) inner_dims_pos = [0, 1] inner_tiles = [5, 5] into %arg1 : tensor<15x14xf32> -> tensor<3x3x5x5xf32>
// LOWERED:
%padded = tensor.pad %arg0 low[0, 0] high[0, 1] {
    ^bb0(%arg2: index, %arg3: index):
      tensor.yield %cst : f32
    } : tensor<15x14xf32> to tensor<15x15xf32>
%expanded = tensor.expand_shape %padded [[0, 1], [2, 3]] output_shape [3, 5, 3, 5] : tensor<15x15xf32> into tensor<3x5x3x5xf32>
%transposed = linalg.transpose ins(%expanded : tensor<3x5x3x5xf32>) outs(%arg1 : tensor<3x3x5x5xf32>) permutation = [0, 2, 1, 3]

// Lowering with extra-tile padding (added by this PR)
// SOURCE:
%pack = linalg.pack %arg0 padding_value(%cst_0 : f32) inner_dims_pos = [0, 1] inner_tiles = [5, 5] extra_pad_tiles = [1, 1] into %arg1 : tensor<15x14xf32> -> tensor<4x4x5x5xf32>
// LOWERED
%padded = tensor.pad %arg0 low[0, 0] high[5, 6] {
    ^bb0(%arg2: index, %arg3: index):
      tensor.yield %cst : f32
    } : tensor<15x14xf32> to tensor<20x20xf32>
%expanded = tensor.expand_shape %padded [[0, 1], [2, 3]] output_shape [4, 5, 4, 5] : tensor<20x20xf32> into tensor<4x5x4x5xf32>
%transposed = linalg.transpose ins(%expanded : tensor<4x5x4x5xf32>) outs(%arg1 : tensor<4x4x5x5xf32>) permutation = [0, 2, 1, 3]
```

or, with a custom lowering to linalg + scf, one could have:
```
// Lowering with intra-tile padding (as already supported by the op)
// SOURCE:
%pack = linalg.pack %arg0 padding_value(%cst_0 : f32) inner_dims_pos = [0, 1] inner_tiles = [5, 5] into %arg1 : tensor<15x14xf32> -> tensor<3x3x5x5xf32>
// LOWERED:
%0 = linalg.generic {indexing_maps = [#map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} outs(%arg1 : tensor<3x3x5x5xf32>) {
    ^bb0(%out: f32):
      %1 = linalg.index 0 : index
      %2 = linalg.index 1 : index
      %3 = linalg.index 2 : index
      %4 = linalg.index 3 : index
      %5 = affine.apply #affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>(%1, %3)[%c5]
      %6 = affine.apply #affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>(%2, %4)[%c5]
      %7 = arith.cmpi uge, %6, %c14 : index
      %8 = scf.if %7 -> (f32) {
        scf.yield %cst : f32
      } else {
        %extracted = tensor.extract %arg0[%5, %6] : tensor<15x14xf32>
        scf.yield %extracted : f32
      }
      linalg.yield %8 : f32
    } -> tensor<3x3x5x5xf32>

// Lowering with extra-tile padding (added by this PR)
// SOURCE:
%pack = linalg.pack %arg0 padding_value(%cst_0 : f32) inner_dims_pos = [0, 1] inner_tiles = [5, 5] extra_pad_tiles = [1, 1] into %arg1 : tensor<15x14xf32> -> tensor<4x4x5x5xf32>
// LOWERED
%0 = linalg.generic {indexing_maps = [#map1], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} outs(%arg1 : tensor<4x4x5x5xf32>) {
    ^bb0(%out: f32):
      %1 = linalg.index 0 : index
      %2 = linalg.index 1 : index
      %3 = linalg.index 2 : index
      %4 = linalg.index 3 : index
      %5 = affine.apply affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>(%1, %3)[%c5]
      %6 = affine.apply affine_map<(d0, d1)[s0] -> (d0 * s0 + d1)>(%2, %4)[%c5]
      %7 = arith.cmpi uge, %5, %c15 : index
      %8 = arith.cmpi uge, %6, %c14 : index
      %9 = arith.ori %7, %8 : i1
      %10 = scf.if %9 -> (f32) {
        scf.yield %cst : f32
      } else {
        %extracted = tensor.extract %arg0[%5, %6] : tensor<15x14xf32>
        scf.yield %extracted : f32
      }
      linalg.yield %10 : f32
    } -> tensor<4x4x5x5xf32>
    return %0 : tensor<4x4x5x5xf32>
  }
```
in both cases, the extra-pad-tiles are handled by the same padding mechanism that we already materialize for the intra-tile padding.

You ask why this can't be handled through separate ops (e.g. pad near the pack), but by the same logic why do we need the intra-tile padding in the `linalg.pack` if we could simply pad the input before the packing? These IRs should be equivalent:
```
// with intra-tile padding done by linalg.pack
%pack = linalg.pack %arg0 padding_value(%cst_0 : f32) inner_dims_pos = [0, 1] inner_tiles = [5, 5] into %arg1 : tensor<15x14xf32> -> tensor<3x3x5x5xf32>

// with intra-tile padding done separately
%padded = tensor.pad %arg0 low[0, 0] high[0, 1] {
    ^bb0(%arg2: index, %arg3: index):
      tensor.yield %cst : f32
    } : tensor<15x14xf32> to tensor<15x15xf32>
%pack = linalg.pack %padded inner_dims_pos = [0, 1] inner_tiles = [5, 5] into %arg1 : tensor<15x15xf32> -> tensor<3x3x5x5xf32>
```

We can produce extra-tile padding in exactly the same way:
```
%padded = tensor.pad %arg0 low[0, 0] high[5, 6] {
    ^bb0(%arg2: index, %arg3: index):
      tensor.yield %cst : f32
    } : tensor<15x14xf32> to tensor<20x20xf32>
%pack = linalg.pack %padded inner_dims_pos = [0, 1] inner_tiles = [5, 5] into %arg1 : tensor<20x20xf32> -> tensor<4x4x5x5xf32>
```
but we have no way of expressing it directly in the `linalg.pack` op. Why would we want to keep such limitation?  I would argue that the current handling of padding in `linalg.pack` is incomplete, and this PR tries to extend it to all cases.

https://github.com/llvm/llvm-project/pull/189049