[Mlir-commits] [mlir] [mlir][linalg] Add support for masked vectorization of `tensor.insert_slice` (1/N) (PR #122927)

Thu Jan 23 03:54:08 PST 2025

================
@@ -59,6 +59,38 @@ vectorizeConvolution(RewriterBase &rewriter, LinalgOp convOp,
                      ArrayRef<bool> inputVecScalableFlags = {},
                      bool flatten1DDepthwiseConv = false);
 
+/// Vectorize tensor::InsertSliceOp with:
+///   * vector::TransferReadOp + vector::TransferWriteOp
+/// The vector sizes are either:
+///   * user-provided in `inputVectorSizes`, or
+///   * inferred from the static dims in the input and output tensors.
+/// Bails out if:
+///   * vector sizes are not user-provided, and
+///   * at least one dim is dynamic (in both the input and output tensors),
+///   bails out.
+///
+/// Before:
+///     !t_in_type = tensor<1x2x3xf32>
+///     !t_out_type = tensor<9x8x7x1x2x3xf32>
+///     !v_type = vector<1x2x3xf32>
+///     %inserted_slice = tensor.insert_slice %src into %dest ... : !t_in_type
+///     into !t_out_type
+/// After:
+///     %read = vector.transfer_read %src[...], %pad ... : !t_in_type, !v_type
+///     %write = vector.transfer_write %read, %dest ... : !v_type, !t_out_type
----------------
banach-space wrote:

@hanhanW , these are great questions - those things made me wonder as well at some point :) 

> I can't follow why it becomes vector.transfer_read + write pair now. Can't it be a single vector.transfer_write %src, %dest ....?

The most intuitive way to see this is by noting that:
*  `tensor.insert_slice` operates on tensors (i.e. the Op requires **two tensors**), whereas 
* `vector.transfer_write` requires a vector to write  (i.e. the Op requires **one tensor** + **one vector**). 

Now, here's an important question:
*  where is the vector (that `vector.transfer_write` would "write) meant to come from? 

Ah! From `vector.transfer_read` 😅 

So, going back to the example from the comment, note that with `tensor.insert_slice`, all operands are `TensorType`. That's `!t_in_type`:
```cpp
///     %inserted_slice = tensor.insert_slice %src into %dest ... : !t_in_type
```
However, in order to operate on vectors, we need to ... switch to `VectorType`. That's `!v_type` as the source/destination:
```mlir
// Read from "source" tensor into a vector. Now we are in the "land of vectors" :)
%read = vector.transfer_read %src[...], %pad ... : !t_in_type, !v_type
// Write vector to the destination "tensor", i.e. switch back to the "land of tensors" :)
%write = vector.transfer_write %read, %dest ... : !v_type, !t_out_type
```

I think that this example also demonstrates quite nicely that indeed `vector.transfer_read` and `vector.transfer_write` are **boundary Ops** (as in, boundary between `TensorType` and `VectorType`).

I hope that this makes sense. Clearly the comment could benefit from an update.

--- 

**2nd thread**

> My understanding is that the inputVectorSizes is somewhat tied to iteration domain, which is related to their TilingInterface implementation. 

That's one option, yes. But what if you're attempting to vectorize something that e.g. has not been tiled? See e.g. examples for `tensor.insert_slice` added to this PR.

When building a full compilation flow, you'd want the tile sizes and vector sizes to match, yes. But that's not a requirement. 

> However, it looks like there are two masks in the vectorization result (which I don't understand now).

Note, this patch adds only one mask. The 2nd mask is added here:
* https://github.com/llvm/llvm-project/pull/123031

(trying to split my PRs into smaller patches to make reviewing easier).

>  I wonder this inputVectorSizes is for read mask or write mask, and the reason behind the decision.

In fact, it's tied to `tensor.insert_slice`, so one mask is used for both. First, we assume that the underlying `tensor.insert_slice` Op is **valid**. So, if we calculate the mask based on the sizes of the value to be stored:
```cpp
// Taken from the the newly added mask calculation in vectorizeAsInsertSliceOp
LogicalResult status =
    cast<ReifyRankedShapedTypeOpInterface>(srcDefOp).reifyResultShapes(
        rewriter, reifiedSrcSizes);
// (...)
Value maskOp = rewriter.create<vector::CreateMaskOp>(
    sliceOp.getLoc(), readMaskType, reifiedSrcSizes[0]);
```
then, it should be safe to use it for both `vector.transfer_read` and `vector.transfer_write`.

> we can infer the vector sizes from the tiling artifacts IR (i.e., affine_min/apply ops) automatically

Going back to my point earlier, In theory, vector sizes can indeed be inferred from various artefacts and that's what we should be aiming for. However, the vectorizer should work independently of those artefacts, right? Put differently,
* If the vector sizes can be inferred, let's used that.
* If not, we need a fall back version, i.e. the vectorizer should be happy with any arbitrary (within reason) sizes.

--- 
I hope that this make sense 😅 Btw, perhaps it would help if I moved the mask calculation to a separate patch?

-Andrzej

https://github.com/llvm/llvm-project/pull/122927