[Mlir-commits] [mlir] [mlir][vector] Fix for WarpOpScfForOp failure when scf.for has results that are unused. (PR #141853)

Fri May 30 07:29:53 PDT 2025

================
@@ -1554,22 +1554,37 @@ struct WarpOpScfForOp : public WarpDistributionPattern {
     llvm::SmallSetVector<Value, 32> escapingValues;
     SmallVector<Type> inputTypes;
     SmallVector<Type> distTypes;
+    auto collectEscapingValues = [&](Value value) {
+      if (!escapingValues.insert(value))
+        return;
+      Type distType = value.getType();
+      if (auto vecType = dyn_cast<VectorType>(distType)) {
+        AffineMap map = distributionMapFn(value);
+        distType = getDistributedType(vecType, map, warpOp.getWarpSize());
+      }
+      inputTypes.push_back(value.getType());
+      distTypes.push_back(distType);
+    };
+
     mlir::visitUsedValuesDefinedAbove(
         forOp.getBodyRegion(), [&](OpOperand *operand) {
           Operation *parent = operand->get().getParentRegion()->getParentOp();
           if (warpOp->isAncestor(parent)) {
-            if (!escapingValues.insert(operand->get()))
-              return;
-            Type distType = operand->get().getType();
-            if (auto vecType = dyn_cast<VectorType>(distType)) {
-              AffineMap map = distributionMapFn(operand->get());
-              distType = getDistributedType(vecType, map, warpOp.getWarpSize());
-            }
-            inputTypes.push_back(operand->get().getType());
-            distTypes.push_back(distType);
+            collectEscapingValues(operand->get());
           }
         });
 
+    // Any forOp result that is not already yielded by the warpOp
+    // region is also considered escaping and must be returned by the
+    // original warpOp.
+    for (OpResult forResult : forOp.getResults()) {
+      // Check if this forResult is already yielded by the yield op.
+      if (llvm::is_contained(yield->getOperands(), forResult)) {
+        continue;
+      }
+      collectEscapingValues(forResult);
+    }
+
----------------
kurapov-peter wrote:

Ok, I had to play with it for a bit to understand how it works. I think adding results of the `ForOp` to the yielded values as the first step is pretty confusing since the resulting warp op after transformation should not have those. I've modified your lit to make an example that would probably be helpful to others to understand it better.
Initial IR:
```mlir
func.func @warp_scf_for_unused_yield(%arg0: index) {
  %c128 = arith.constant 128 : index
  %c1 = arith.constant 1 : index
  %c0 = arith.constant 0 : index
  %0:2 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>, vector<4xf32>) {
    %ini = "some_def"() : () -> (vector<128xf32>)
    %ini1 = "some_def"() : () -> (vector<128xf32>)
    %other = "other_def"() : () -> (vector<128xf32>)
    %3:3 = scf.for %arg3 = %c0 to %c128 step %c1 iter_args(%arg4 = %ini, %arg5 = %ini1, %other_2 = %other) -> (vector<128xf32>, vector<128xf32>, vector<128xf32>) {
      %add = arith.addi %arg3, %c1 : index
      %1  = "some_def"(%arg5, %add) : (vector<128xf32>, index) -> (vector<128xf32>)
      %acc = "some_def"(%add, %arg4, %1) : (index, vector<128xf32>, vector<128xf32>) -> (vector<128xf32>)
      %other2 = "other2_def"(%arg4) : (vector<128xf32>) -> (vector<128xf32>)
      scf.yield %acc, %1, %other2 : vector<128xf32>, vector<128xf32>, vector<128xf32>
    }
    gpu.yield %3#0, %other : vector<128xf32>, vector<128xf32>
  }
  "some_use"(%0#0) : (vector<4xf32>) -> ()
  "other_use"(%0#1) : (vector<4xf32>) -> ()
  return
}
```

Right after the `newWarpOp` creation the warp op looks like:
```mlir
%0:4 = gpu.warp_execute_on_lane_0(%arg0)[32] -> (vector<4xf32>, vector<4xf32>, vector<4xf32>, vector<4xf32>) {
  %1 = "some_def"() : () -> vector<128xf32>
  %2 = "some_def"() : () -> vector<128xf32>
  %3 = "other_def"() : () -> vector<128xf32>
  %4:3 = scf.for %arg1 = %c0 to %c128 step %c1 iter_args(%arg2 = %1, %arg3 = %2, %arg4 = %3) -> (vector<128xf32>, vector<128xf32>, vector<128xf32>) {
    %5 = arith.addi %arg1, %c1 : index
    %6 = "some_def"(%arg3, %5) : (vector<128xf32>, index) -> vector<128xf32>
    %7 = "some_def"(%5, %arg2, %6) : (index, vector<128xf32>, vector<128xf32>) -> vector<128xf32>
    %8 = "other2_def"(%arg2) : (vector<128xf32>) -> vector<128xf32>
    scf.yield %7, %6, %8 : vector<128xf32>, vector<128xf32>, vector<128xf32>
  }
  gpu.yield %4#0, %3, %4#1, %4#2 : vector<128xf32>, vector<128xf32>, vector<128xf32>, vector<128xf32>
}
```
The results of the `ForOp` are now returned even though the op itself will be sinked through `gpu.yield`. These are patched up later again to produce the valid result. This is a bit unexpected after you read the commend for the transformation (and the reason it caught my attention).

I think this is somewhat brittle. If it is done to preserve some existing code I'd reconsider in favor of simplicity.

https://github.com/llvm/llvm-project/pull/141853