[Mlir-commits] [mlir] [mlir][vector] Fix incorrect byte-alignment assumption in ConvertVectorStore (PR #189235)

Sun Mar 29 06:01:53 PDT 2026

https://github.com/joker-eph created https://github.com/llvm/llvm-project/pull/189235

When `ConvertVectorStore` emits the narrow-type emulation for a `vector.store` into a 2-D memref, it previously assumed that if the trailing dimension of the memref exactly matches the vector size (`trailingDimsMatch`), then the last-dimension index must be zero and no sub-byte alignment adjustment is needed. This assumption is wrong: a valid store such as

  vector.store %v, %src[%c0, %c1] : memref<3x4xi2>, vector<4xi2>

has a non-zero column index (%c1 == 1) even though trailingDim (4) equals the vector size (4). The incorrect shortcut caused the pattern to fall into the "aligned" path and emit a plain bitcast + store at byte offset 0, silently dropping elements [1], [2], [3] of the first byte and overwriting the wrong memory.

Fix: prefer `linearizedInfo.intraDataOffset` (which gives the exact sub-element offset for any constant-index store) and only fall back to the old `0` assumption when the indices are fully dynamic (i.e., `intraDataOffset` cannot be folded to a constant) **and** `isDivisibleInSize && trailingDimsMatch` still holds. This preserves the existing behaviour for dynamic-index stores while fixing the constant-index case.

Fixes #131528

Assisted-by: Claude Code

>From 32b2fcb2ddf6ba7fea559813f816e98854e1c970 Mon Sep 17 00:00:00 2001
From: Mehdi Amini <joker.eph at gmail.com>
Date: Sun, 29 Mar 2026 05:17:59 -0700
Subject: [PATCH] [mlir][vector] Fix incorrect byte-alignment assumption in
 ConvertVectorStore

When `ConvertVectorStore` emits the narrow-type emulation for a
`vector.store` into a 2-D memref, it previously assumed that if the
trailing dimension of the memref exactly matches the vector size
(`trailingDimsMatch`), then the last-dimension index must be zero and
no sub-byte alignment adjustment is needed. This assumption is wrong:
a valid store such as

  vector.store %v, %src[%c0, %c1] : memref<3x4xi2>, vector<4xi2>

has a non-zero column index (%c1 == 1) even though trailingDim (4)
equals the vector size (4). The incorrect shortcut caused the pattern
to fall into the "aligned" path and emit a plain bitcast + store at
byte offset 0, silently dropping elements [1], [2], [3] of the first
byte and overwriting the wrong memory.

Fix: prefer `linearizedInfo.intraDataOffset` (which gives the exact
sub-element offset for any constant-index store) and only fall back to
the old `0` assumption when the indices are fully dynamic (i.e.,
`intraDataOffset` cannot be folded to a constant) **and**
`isDivisibleInSize && trailingDimsMatch` still holds. This preserves
the existing behaviour for dynamic-index stores while fixing the
constant-index case.

A regression test with `memref<3x4xi2>` / `vector<4xi2>` at `[0, 1]`
is added to `vector-emulate-narrow-type-unaligned.mlir`; the fixed
output emits two `memref.generic_atomic_rmw` operations covering the
byte boundary correctly.

Fixes https://github.com/llvm/llvm-project/issues/131528

Assisted-by: Claude Code
---
 .../Transforms/VectorEmulateNarrowType.cpp    | 26 ++++++++++++-------
 .../vector-emulate-narrow-type-unaligned.mlir | 24 +++++++++++++++++
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp b/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp
index 583cda7ac2810..60161cdf15dfd 100644
--- a/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp
+++ b/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp
@@ -635,10 +635,13 @@ struct ConvertVectorStore final : OpConversionPattern<vector::StoreOp> {
       return success();
     }
 
-    // Do the trailing dim for source and destination match? If yes, then the
-    // corresponding index must be 0.
-    // FIXME: There's no way to tell for dynamic shapes, so we should bail out.
-    // However, that makes some tests fail, so we need to audit first.
+    // Do the trailing dim for source and destination match? If yes, and if the
+    // access indices are not all constant, then assume the last index is 0
+    // (byte-aligned). Note: for constant indices, the intraDataOffset computed
+    // below will give the exact value, so the trailingDimsMatch shortcut is
+    // not used in that case.
+    // FIXME: For dynamic indices where trailingDimsMatch, the assumption that
+    // the last index is 0 (byte-aligned) may be incorrect. See issue #131528.
     auto trailingDim = op.getBase().getType().getShape().back();
     bool trailingDimsMatch =
         ShapedType::isDynamic(trailingDim) || trailingDim == origElements;
@@ -646,8 +649,6 @@ struct ConvertVectorStore final : OpConversionPattern<vector::StoreOp> {
     auto stridedMetadata =
         memref::ExtractStridedMetadataOp::create(rewriter, loc, op.getBase());
 
-    // FIXME: ATM, we do not test cases where offsets, sizes, or strides are
-    // non-zero. As such, this is not needed.
     OpFoldResult linearizedIndices;
     memref::LinearizedMemRefInfo linearizedInfo;
     std::tie(linearizedInfo, linearizedIndices) =
@@ -658,10 +659,17 @@ struct ConvertVectorStore final : OpConversionPattern<vector::StoreOp> {
             stridedMetadata.getConstifiedMixedStrides(),
             getAsOpFoldResult(adaptor.getIndices()));
 
+    // Prefer the exact intraDataOffset when it can be folded (e.g. all-constant
+    // indices). Fall back to 0 only when the trailing dimension exactly matches
+    // the vector size (trailingDimsMatch), because in that case a dynamic last
+    // index implies byte-alignment (the caller is responsible for passing a
+    // valid, aligned index). If neither condition holds, bail out.
     std::optional<int64_t> foldedNumFrontPadElems =
-        (isDivisibleInSize && trailingDimsMatch)
-            ? 0
-            : getConstantIntValue(linearizedInfo.intraDataOffset);
+        getConstantIntValue(linearizedInfo.intraDataOffset);
+    if (!foldedNumFrontPadElems) {
+      if (isDivisibleInSize && trailingDimsMatch)
+        foldedNumFrontPadElems = 0;
+    }
 
     if (!foldedNumFrontPadElems) {
       return rewriter.notifyMatchFailure(
diff --git a/mlir/test/Dialect/Vector/vector-emulate-narrow-type-unaligned.mlir b/mlir/test/Dialect/Vector/vector-emulate-narrow-type-unaligned.mlir
index 21f073efc49b2..bec7736b90973 100644
--- a/mlir/test/Dialect/Vector/vector-emulate-narrow-type-unaligned.mlir
+++ b/mlir/test/Dialect/Vector/vector-emulate-narrow-type-unaligned.mlir
@@ -562,3 +562,27 @@ func.func @vector_store_i2_const_index_one_partial_store(%arg0: vector<1xi2>) {
 // CHECK: %[[BITCAST2:.+]] = vector.bitcast %[[SELECT]] : vector<4xi2> to vector<1xi8>
 // CHECK: %[[EXTRACT2:.+]] = vector.extract %[[BITCAST2]][0] : i8 from vector<1xi8>
 // CHECK: memref.atomic_yield %[[EXTRACT2]] : i8
+
+// -----
+
+// Regression test for https://github.com/llvm/llvm-project/issues/131528.
+// A vector.store with a non-zero constant column index on a 2D memref where the
+// trailing dimension matches the vector size must NOT be treated as
+// byte-aligned. Instead it must emit two partial (RMW) stores since the
+// 4-element i2 vector starting at column 1 crosses a byte boundary.
+func.func @vector_store_i2_2d_const_nonzero_col(%arg0: vector<4xi2>) {
+  %src = memref.alloc() : memref<3x4xi2>
+  %c0 = arith.constant 0 : index
+  %c1 = arith.constant 1 : index
+  vector.store %arg0, %src[%c0, %c1] : memref<3x4xi2>, vector<4xi2>
+  return
+}
+
+// CHECK-LABEL: func @vector_store_i2_2d_const_nonzero_col(
+// CHECK-SAME: %[[ARG0:.+]]: vector<4xi2>)
+// CHECK: %[[ALLOC:.+]] = memref.alloc() : memref<3xi8>
+// CHECK: %[[C0:.+]] = arith.constant 0 : index
+// Emits two partial atomic RMWs: one for byte 0 (elements at positions [1..3])
+// and one for byte 1 (element at position [0]).
+// CHECK: memref.generic_atomic_rmw %[[ALLOC]][%[[C0]]]
+// CHECK: memref.generic_atomic_rmw %[[ALLOC]][{{.+}}]