[Mlir-commits] [mlir] [mlir][linalg] Add extra_pad_tiles to linalg.pack & unpack (PR #189049)

Mon Apr 13 06:18:19 PDT 2026

fabrizio-indirli wrote:

Thanks for your detailed answer and for taking the time to provide suggestions! :)

I pushed a new simplified version that gets rid of the extra attributes, and simply allows having a bigger output shape (with extra tiles) by relaxing the checks in the verifier. I also restricted this new behavior only to the cases where the padded outer dim is known statically. This also allowed to simplify the code and reduce the size of this patch.

> The other point is that the PR will make pack/unpack asymmetry, which sounds not good to me.

Good point; I believe that this is now solved by removing the custom attributes.

> The restriction was not just a semantic cleanup — it enabled concrete simplifications in key transformations like tiling, fusion and vectorization.
If the restriction no longer holds, it is not possible to tile and fuse the op in some cases.

Initially I had forgotten to look the the Tiling and Vectorization paths, sorry for that. 
In my new version of the patch I extended the **Tiling** flow to support also `linalg.pack` and `.unpack` with extra tiles when the padded dim is not tiled (the slice has to cover the full source dimension for which we are adding extra tiles), and I added some relevant test cases.

I understand this might look like a strong limitation, but it affects only the dim(s) for which extra tiles are being added (if any). There were already similar restrictions in place already for the tiling of `linalg.pack`: for example the incoming producer tile size must be a multiple of the inner tile size (which means that even in the existing implementation "_it is not possible to tile and fuse the op in some cases_"). Thus, for the extra pad tiles case, I don't see why can't we add a similar check in the tiling implementation (as I did in my new patch), instead of preventing extra tiles altogether. Besides, not everyone uses the tile-and-fuse facilities.

For **Vectorization**, it seems to me that the current implementation is already compatible with my modifications, and I added a couple test cases to test it. Not sure if I'm missing something: if you believe there are cases in which the vectorization would not work, could you provide a small example?

> The "no masking support" argument doesn't hold to me. There is an existing lowering pipeline handles this

Your argument assumes that everyone is (and will be) always lowering `linalg.pack` to `vector` in their compilers. But IMHO we shouldn't make this assumption: one of the nice features of MLIR is that you can mix and match dialects and lowering paths as you please. For example, in our compiler we don't lower linalg.pack to vector. 
In some cases it may not even be an option, if the target hardware doesn't directly support vectors or other required facilities. For example, in your answer you say that "_[VectorEmulateMaskedLoadStore](https://github.com/llvm/llvm-project/blob/c1ebd2f1c07f1dc51df1c1a56891fe564cbadc19/mlir/lib/Dialect/Vector/Transforms/VectorEmulateMaskedLoadStore.cpp#L22-L96) lowers vector.maskedload to per-element scf.if + memref.load for targets without native mask support_", but what if the target hardware is a fixed-function accelerator that doesn't support control flow?
Moreover, even when we're allowed to use this lowering flow, you said yourself that without masking we would end up with a very slow implementation, unless we did some clean-up afterwards. But cleaning up bad code after it's been produced is IMHO a worse approach than producing efficient code in the first place, where possible.

More in general, I think that linalg is a high-level generic dialect and, in my opinion, it should not be tied too much with lower-level implementation details such as the lowering to `vector`.

>  When you talk about the decomposition, [...], how about you swap pad and pack op? It is easy to write a pattern to bubble up such pad op [...]. This way, you'll see contiguous tensor.pad op if you run lowerPackOp and it is easy to collapse them into a single tensor.pad op, which results in the IR you want.

I suppose it might work, but then again we'd be producing IR with redundant operations (two `tensor.pad` operations in the to pad the same tensor) and then hoping to clean it up with additional bubble-up and fusion passes. I know that this pattern can happen in compilers, but where possible I'd prefer producing clean IR in the first place. 
The reason I'm proposing this patch is that we can reuse the existing padding facilities of `linalg.pack` (already used to support intra-tile padding) to also handle extra-tile padding, without introducing additional redundant ops, and this works both at the `tensor` and `vector` level (take a look at the provided test cases). And with `memref` it's even more impactful, since padding memrefs requires longer chains of operations that we'd rather not duplicate...

https://github.com/llvm/llvm-project/pull/189049