[Mlir-commits] [mlir] [mlir][linalg] Implement Winograd Conv2D. (PR #94470)

Fri Jun 14 12:25:10 PDT 2024

https://github.com/Max191 requested changes to this pull request.

Thanks for working on adding these ops upstream, and with more options on kernel size/input tile size than what we have in IREE! Just some higher level op semantics comments for now.

The main difference I see between this implementation and the IREE implementation is that the dimensionality of the transforms are different. In this implementation, the 4D input tensor is transformed into another 4D transformed tensor, while in the IREE implementation, a 6D input tensor is produced. The extra dimensionality comes from having `(alpha)x(alpha)` dimensions expanded from the `H` and `W` dimensions. In other words, with an `NHWC` layout, the IREE input transform would produce `tensor<TxTxNxceil(H/T)xceil(W/T)xC>`.

This extra dimensionality is very useful for tiling, since the iteration space is actually over the innermost of those 6 dimensions. Trying to implement a tiled implementation where some of those dimensions are collapsed is tricky.

I would suggest reworking the op semantics a bit to allow for this extra dimensionality. Then, the tiled implementation from IREE should also just drop right in here. Also, another benefit of having separate input tile dimensions is that the input tile dimensions can be innermost if desired. This can potentially provide performance benefits from having continuous accesses in the winograd ops. We have experimentally tried this in IREE, but have not merged the change yet.

The other difference I notice, is that there is no implicit padding on the input transform. This means that for shapes that are unaligned with the input tiles, the input transform will not capture the last partial tile of the input tensor. Instead the input transform should extract the last partial tile, and pad the slice with zeros to the input tile size (alpha).

The last comment is that the implementation here has explicit I64Attrs for the output height and width. This should not be necessary if the ops have padding semantics as suggested above, since there is no need to know the output shape for the input or filter transform in that case. Having this attribute means that the op is restricted to static image sizes only (making this an operand could allow some dynamic support, but it is better not to rely on having these sizes since it is not necessary).

https://github.com/llvm/llvm-project/pull/94470