[Mlir-commits] [mlir] [mlir][vector] Document `ConvertVectorStore` + unify var names (nfc) (PR #126422)

Fri Feb 14 01:34:16 PST 2025

================
@@ -432,7 +432,86 @@ namespace {
 // ConvertVectorStore
 //===----------------------------------------------------------------------===//
 
-// TODO: Document-me
+// Emulate vector.store using a multi-byte container type
+//
+// The container type is obtained through Op adaptor and would normally be
+// generated via `NarrowTypeEmulationConverter`.
+//
+// EXAMPLE 1
+// (aligned store of i4, emulated using i8)
+//
+//      vector.store %src, %dest[%idx_1, %idx_2] : memref<4x8xi4>, vector<8xi4>
+//
+// is rewritten as:
+//
+//      %src_bitcast = vector.bitcast %src : vector<8xi4> to vector<4xi8>
+//      vector.store %src_bitcast, %dest_bitcast[%idx]
+//        : memref<16xi8>, vector<4xi8>
+//
+// EXAMPLE 2
+// (unaligned store of i2, emulated using i8, non-atomic)
+//
+//    vector.store %src, %dest[%c2, %c0] :memref<3x3xi2>, vector<3xi2>
+//
+// The i2 store is emulated through 2 x RMW sequences. The destination i2 memref
+// is modelled using 3 bytes:
+//
+//    Byte 0     Byte 1     Byte 2
+// +----------+----------+----------+
+// | oooooooo | ooooNNNN | NNoooooo |
+// +----------+----------+----------+
+//
+// N - (N)ew entries (i.e. to be overwritten by vector.store)
+// o - (o)ld entries (to be preserved)
+//
+// The following 2 RMW sequences will be generated:
+//
+//    %init = arith.constant dense<0> : vector<4xi2>
+//
+//    (RMW sequence for Byte 1)
+//    (Mask for 4 x i2 elements, i.e. a byte)
+//    %mask_1 = arith.constant dense<[false, false, true, true]>
+//    %src_slice_1 = vector.extract_strided_slice %src
+//      {offsets = [0], sizes = [2], strides = [1]}
+//      : vector<3xi2> to vector<2xi2>
+//    %init_with_slice_1 = vector.insert_strided_slice %src_slice_1, %init
+//      {offsets = [2], strides = [1]}
+//      : vector<2xi2> into vector<4xi2>
+//    %dest_byte_1 = vector.load %dest[%c1]
+//    %dest_byte_1_as_i2 = vector.bitcast %dest_byte_1
+//      : vector<1xi8> to vector<4xi2>
+//    %res_byte_1 = arith.select %mask_1, %init_with_slice_1, %dest_byte_1_as_i2
+//    %res_byte_1_as_i8 = vector.bitcast %res_byte_1
+//    vector.store %res_byte_1_as_i8, %dest[1]
+
+//    (RMW sequence for Byte 22)
+//    (Mask for 4 x i2 elements, i.e. a byte)
+//    %mask_2 = arith.constant dense<[true, false, false, false]>
+//    %src_slice_2 = vector.extract_strided_slice %src
+//      : {offsets = [2], sizes = [1], strides = [1]}
+//      : vector<3xi2> to vector<1xi2>
+//    %initi_with_slice_2 = vector.insert_strided_slice %src_slice_2, %init
+//      : {offsets = [0], strides = [1]}
+//      : vector<1xi2> into vector<4xi2>
+//    %dest_byte_2 = vector.load %dest[%c2]
----------------
banach-space wrote:

Thanks for the suggestion! I was wondering how to avoid this long comment and your suggestion is exactly what we should be doing! 🙏🏻 

As this example is taken from "vector-emulate-narrow-type-unaligned-non-atomic.mlir", that's the test file that I've updated to help here. Please check the latest update.

Note, I've made quite a few changes:
* Extended comments.
* Fix `DOWNCAST` vs `UPCAST`.
* Renamed some variables to avoid generic names (e.g. `%arg0` -> `%src`, `%0` -> `%dest`).
* Added more `CHECK-LINES`, e.g. `// CHECK-SAME:    : vector<1xi8> to vector<4xi2>` to make sure that the right casting is generated.
* Followed formatting style from [vectorize-convolution.mlir](https://github.com/llvm/llvm-project/blob/main/mlir/test/Dialect/Linalg/vectorize-convolution.mlir). IMHO it's a very "readable" style that's particularly handy for complex tests like these ones.

I appreciate that these are quite intrusive changes, but since it's meant as documentation, it felt like the right thing to do. But I am happy to adapt/revert if you feel that this is too much.

Thanks for reviewing!

https://github.com/llvm/llvm-project/pull/126422