[Mlir-commits] [mlir] [mlir][ArmSME] Lower multi-tile stores to a single loop (PR #96187)

Fri Jun 21 08:23:04 PDT 2024

================
@@ -373,6 +374,130 @@ struct LegalizeTransferWriteOpsByDecomposition
   }
 };
 
+/// Legalize a multi-tile transfer_write as a single store loop. This is done as
+/// part of type decomposition as at this level we know each tile write is
+/// disjoint, but that information is lost after decomposition (without
+/// static analysis).
+///
+/// Example (in pseudo-MLIR):
+///
+/// ```
+/// vector.transfer_write vector, dest[x, y], mask
+///   : vector<[16]x[4]xf32>, memref<?x?xf32>
+/// ```
+/// Is rewritten to:
+/// ```
+/// for i in range (0, 4 * vscale) {
+///   let sliceRow = i + tile_n.row * vscale;              ─┐
+///   let sliceCol = tile_n.col * vscale;                   |
+///   slice = vector.extract tile_n[i]                      |
+///     : vector<[4]xf32> from vector<[16]x[4]xf32>         |
+///   slice_mask = vector.extract mask[sliceRow]            |- Repeated 4x for
+///     : vector<[4]xi1> from vector<[16]x[4]xi1>           |  all tiles in
+///   vector.transfer_write                                 |  [16]x[4]
+///     slice, dest[x + sliceRow, y + sliceCol], slice_mask |
+///     : vector<[4]xf32>, memref<?x?xf32>                  ┘
+/// }
----------------
banach-space wrote:

I'm finding this rather tricky to follow 😅  I think that it would be easier if you:
* used e.g. `i16` instead of `f32` (so that there are 2 tiles)
* presented a full example rather than typing `Repeated 4x for all tiles`

Let me also share some more specific suggestions:
* `for i in range ()` -> `for %row_idx in range()` (i.e. avoid enigmatic `i`)
* `tile_n` -> `src_tile`? (what's `_n` meant to represent?)
* what's `tile_n.col` and `tile_n.row`?

IIUC, for `[16] x [4]` there are 4 vertical tiles and `tile_n.col` would always be 0?

https://github.com/llvm/llvm-project/pull/96187