[Mlir-commits] [mlir] [mlir][vector] Add vector.transpose with unit-dim to vector.shape_cast pattern (PR #72105)

Thu Nov 23 03:56:36 PST 2023

qedawkins wrote:

>     * Is  -> <1xN> really a transpose? I can see that you can describe the transformation on the shape as a transpose, but that seems fairly arbitrary to me to try to stick to "transpose" instead of, for example, contract_shape / expand_shape (or whatever doing `<Nx1>` -> `<N>` -> `<1xN>` would be named).
>       Is what you'd want in this case is an `affine_shape_cast` where the cast isn't "arbitrary" (not that it is here either) but encoded with an affine map? That is something like `%1 = vector.reshape : vector<1x4xf32> with (d0, d1) -> (d1, d0) to vector<4x1xf32>`? The map can be arbitrary as long as it is consistent with the "no data movement aspect".
>       (now by your argument, a map is also somehow "unstructured" compared to a "transpose" ;) ).

Yes, I think we're finally narrowing it down to the crux of the issue. How about this, I want to add a pattern that folds `vector.transpose` into `vector.contract`. This runs relatively early on, before I've tried to do any sort of unit dim folding (and would run on IR without the same unit dims).

```mlir
%0 = vector.transpose %arg1 [1, 0] : vector<4x1xf32> to vector<1x4xf32>
%1 = vector.contract {
    indexing_maps = [
        affine_map<(d0, d1) -> (d0, d1)>,
        affine_map<(d0, d1) -> (d0, d1)>,
        affine_map<(d0, d1) -> (d0)>],
    iterator_types = ["parallel", "reduction"],
    kind = #vector.kind<add>}
    %arg0, %0, %arg2 : vector<1x4xf32>, vector<1x4xf32> into vector<1xf32>
```
becomes
```mlir
%1 = vector.contract {
    indexing_maps = [
        affine_map<(d0, d1) -> (d0, d1)>,
        affine_map<(d0, d1) -> (d1, d0)>,
        affine_map<(d0, d1) -> (d0)>],
    iterator_types = ["parallel", "reduction"],
    kind = #vector.kind<add>}
    %arg0, %arg1, %arg2 : vector<1x4xf32>, vector<4x1xf32> into vector<1xf32>
```
However if this is a global canonicalization, I would see a shape cast instead of a transpose
```mlir
%0 = vector.shape_cast %arg1 : vector<4x1xf32> to vector<1x4xf32>
%1 = vector.contract {
    indexing_maps = [
        affine_map<(d0, d1) -> (d0, d1)>,
        affine_map<(d0, d1) -> (d0, d1)>,
        affine_map<(d0, d1) -> (d0)>],
    iterator_types = ["parallel", "reduction"],
    kind = #vector.kind<add>}
    %arg0, %0, %arg2 : vector<1x4xf32>, vector<1x4xf32> into vector<1xf32>
```
There is no general pattern for folding shape_cast into a contraction, because the maps of the contraction can only be projected permutations: https://github.com/llvm/llvm-project/blob/0d1b220f368f1c03e4d509efe36f94098f6489c7/mlir/lib/Dialect/Vector/IR/VectorOps.cpp#L946

So if I want to recover the same IR, I have to write a transformation that special cases shape_casts of unit dimensions (because I know the pattern happens to work out in that case). Otherwise, I am _forced_ to do unit dim folding to get to the same state. Even if unit dim folding eventually gets me to where I want, I am fighting the canonicalization patterns to get there; Or I might want to target some kind of matrix multiply intrinsic that expects the redundant unit dimension, and is styled like cooperative_matrix/wmma, which requires opaque types that extend to the loads: https://mlir.llvm.org/docs/Dialects/GPU/#gpusubgroup_mma_load_matrix-gpusubgroupmmaloadmatrixop
All it requires is a vendor to add a matvec variant of this in Vulkan and we'll want to do exactly that: https://mlir.llvm.org/docs/Dialects/SPIR-V/#spirvkhrcooperativematrixmuladd-spirvkhrcooperativematrixmuladdop

> I can see that you can describe the transformation on the shape as a transpose, but that seems fairly arbitrary to me to try to stick to "transpose" instead of, for example, contract_shape / expand_shape (or whatever doing <Nx1> -> <N> -> <1xN> would be named).

100% agree, and I think this is where the disconnect has been coming from. Transpose is a fairly arbitrary choice that happens to work better for some patterns/analysis. I am definitely not arguing it should be the general choice here, but rather that the same is true of shape_cast. Vector dialect, by virtue of representing virtualized "super vectors," is representing looped computations (transfer_read, transfer_write, contract) and the extra permutation info on transpose is convenient when adjacent to such ops. When adjacent to vector.load and otherwise mostly unrolled and "lowered" vector code, it isn't really useful anymore (fwiw, at a glance the SPIR-V failures looked more like the latter to me and should be addressed).

So then my conclusion is that what really needs addressing is better patterns and/or ops for folding away unit vector dims. I think we're actually all trying to get to similar IR in the end (mostly 1-D vector code outside of specific cases). SPIR-V found ways to do it out of necessity, and now others are trying to do the same with LLVM, which happened to interfere with the way SPIR-V did it. If we had a good way to, like Diego said, unify handling of unit dims across `vector.shape_cast`, `vector.broadcast`, `vector.extract/insert`, `vector.extract_element/insert_element` and `vector.transpose`, that would be the best end state.

https://github.com/llvm/llvm-project/pull/72105