[Mlir-commits] [mlir] [MLIR][Vector] Add warp distribution for `vector.step` op (PR #155425)

Wed Aug 27 02:26:39 PDT 2025

================
@@ -705,6 +705,45 @@ struct WarpOpConstant : public WarpDistributionPattern {
   }
 };
 
+/// Sink out step op feeding into a warp op yield.
+/// Vector step op is treated similar to arith.constant, apart from
+/// the result that represents a sequence [0, vec_size).
+/// The sequence is semantically equivalent to warp's threads/lanes indices.
+/// ```
+/// %0 = gpu.warp_execute_on_lane_0(%arg0) -> (vector<1xindex>) {
+///   ...
+///   %cst = vector.step : vector<32xindex>
+///   gpu.yield %cst : vector<1xindex>
+/// }
+/// ```
+/// To
+/// ```
+/// gpu.warp_execute_on_lane_0(%arg0) {
+///   ...
+/// }
+/// %lane_id_vec = vector.broadcast %arg0 : index to vector<1xindex>
+struct WarpOpStep final : public WarpDistributionPattern {
+  using Base::Base;
+  LogicalResult matchAndRewrite(WarpExecuteOnLane0Op warpOp,
+                                PatternRewriter &rewriter) const override {
+    OpOperand *yieldOperand =
+        getWarpResult(warpOp, llvm::IsaPred<vector::StepOp>);
+    if (!yieldOperand)
+      return failure();
+    auto stepOp = yieldOperand->get().getDefiningOp<vector::StepOp>();
+    VectorType resTy = stepOp.getResult().getType();
+    rewriter.startOpModification(warpOp);
+    rewriter.setInsertionPointAfter(warpOp);
+    Value laneIdVec = vector::BroadcastOp::create(
----------------
akroviakov wrote:

> supporting multiples of sg size may need the layout of the output vector.

I think step op support will be a bit tricky to support for multiples of warp size. 
For example, `vector.step` for 64 elements gets divided into 2 elements per lane, which means each lane has `[0,1]` result, so an additional math (likely more complex than plain `+ lane_id`) is needed. In such a setting, the step op `[0,1]` result itself does not carry useful info and can be omitted altogether. 

What benefit do we get by preserving the step op itself (especially for this PR), if its result is likely to be discarded or gets folded by the actual sequence distribution logic?

For this PR, with the SG-size vector restriction, we can cleanly fold it into a lane_id broadcast `index -> vector<1xindex>`. We can skip the speculations on how to best serve the non-existent (yet) distribution of multiple SG sizes if they are likely to be overwritten anyway.

What do you think? @charithaintc @adam-smnk 

https://github.com/llvm/llvm-project/pull/155425