[Mlir-commits] [mlir] [mlir][ArmSME] Support vertical layout in load and store ops (PR #66758)

Wed Sep 20 03:06:16 PDT 2023

c-rhodes wrote:

> Quite a large change, but very well executed and makes a lot of sense. Btw, what logic would be deciding to generate vertical instead of horizontal loads/stores?

Thanks for reviewing. For `vector.load` and `vector.store` I don't believe there is anyway to express this, but perhaps a canonicalization could be added that replaces a transpose of a load/store to a load/store in the opposite direction.

`vector.transfer_read` and `vector.transfer_write` however take an affine_map that can express a transpose, here an example I just created based on `mlir/test/Integration/Dialect/Vector/CPU/test-transfer-to-loops.mlir`:

```
#transpose_map = affine_map<(d0, d1) -> (d1, d0)>

func.func private @printMemrefF32(memref<*xf32>)

func.func @alloc_2d_filled_f32(%arg0: index, %arg1: index) -> memref<?x?xf32> {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %0 = memref.alloc(%arg0, %arg1) : memref<?x?xf32>
  scf.for %arg5 = %c0 to %arg0 step %c1 {
    scf.for %arg6 = %c0 to %arg1 step %c1 {
      %tmp2 = arith.index_cast %arg6: index to i32
      %tmp3 = arith.sitofp %tmp2 : i32 to f32
      memref.store %tmp3, %0[%arg5, %arg6] : memref<?x?xf32>
    }
  }
  return %0 : memref<?x?xf32>
}

func.func @main() {
  %c0 = arith.constant 0 : index
  %c4 = arith.constant 4 : index
  %cst = arith.constant -4.2e+01 : f32

  %0 = call @alloc_2d_filled_f32(%c4, %c4) : (index, index) -> memref<?x?xf32>
  %converted = memref.cast %0 : memref<?x?xf32> to memref<*xf32>
  call @printMemrefF32(%converted): (memref<*xf32>) -> ()

  %1 = vector.transfer_read %0[%c0, %c0], %cst {permutation_map = #transpose_map} : memref<?x?xf32>, vector<4x4xf32>
  vector.transfer_write %1, %0[%c0, %c0] : vector<4x4xf32>, memref<?x?xf32>
  call @printMemrefF32(%converted): (memref<*xf32>) -> ()

  memref.dealloc %0 : memref<?x?xf32>
  return
}
```

run:

```build/bin/mlir-opt vector_transfer_vertical.mlir -pass-pipeline="builtin.module(func.func(convert-vector-to-scf,lower-affine,convert-scf-to-cf),convert-vector-to-llvm,finalize-memref-to-llvm,convert-func-to-llvm,reconcile-unrealized-casts)" |  /home/culrho01/llvm-project/build/bin/mlir-cpu-runner -e main -entry-point-result=void     -shared-libs=/home/culrho01/llvm-project/build/lib/libmlir_runner_utils.so,/home/culrho01/llvm-project/build/lib/libmlir_c_runner_utils.so
Unranked Memref base@ = 0x216f7190 rank = 2 offset = 0 sizes = [4, 4] strides = [4, 1] data =
[[0,   1,   2,   3],
 [0,   1,   2,   3],
 [0,   1,   2,   3],
 [0,   1,   2,   3]]
Unranked Memref base@ = 0x216f7190 rank = 2 offset = 0 sizes = [4, 4] strides = [4, 1] data =
[[0,   0,   0,   0],
 [1,   1,   1,   1],
 [2,   2,   2,   2],
 [3,   3,   3,   3]]
```

We can extend to the lowering to SME to support this.

https://github.com/llvm/llvm-project/pull/66758