[llvm] [VPlan] Optimize FindLast of (binop %IV, live-in) by sinking. (PR #183911)

Tue Mar 31 05:22:41 PDT 2026

================
@@ -5695,25 +5695,64 @@ void VPlanTransforms::addExitUsersForFirstOrderRecurrences(VPlan &Plan,
   }
 }
 
+/// Check if \p V is a binary expression of a widened IV and a loop-invariant
+/// value. Returns the widened IV if found, nullptr otherwise.
+static VPWidenIntOrFpInductionRecipe *getExpressionIV(VPValue *V) {
+  auto *BinOp = dyn_cast<VPWidenRecipe>(V);
+  if (!BinOp || !Instruction::isBinaryOp(BinOp->getOpcode()))
+    return nullptr;
+
+  VPValue *WidenIVCandidate = BinOp->getOperand(0);
+  VPValue *InvariantCandidate = BinOp->getOperand(1);
+  if (!isa<VPWidenIntOrFpInductionRecipe>(WidenIVCandidate))
+    std::swap(WidenIVCandidate, InvariantCandidate);
+
+  if (!InvariantCandidate->isDefinedOutsideLoopRegions())
+    return nullptr;
+
+  return dyn_cast<VPWidenIntOrFpInductionRecipe>(WidenIVCandidate);
+}
+
+/// Create a scalar version of \p BinOp and place it after \p ScalarIV's
+/// defining recipe, replacing \p WidenIV with \p ScalarIV.
+static VPValue *cloneBinOpForScalarIV(VPWidenRecipe *BinOp, VPValue *ScalarIV,
+                                      VPWidenIntOrFpInductionRecipe *WidenIV) {
+  assert(Instruction::isBinaryOp(BinOp->getOpcode()) &&
+         BinOp->getNumOperands() == 2 && "BinOp must have 2 operands");
+  auto *ClonedOp = BinOp->clone();
+  if (ClonedOp->getOperand(0) == WidenIV) {
+    ClonedOp->setOperand(0, ScalarIV);
+  } else {
+    assert(ClonedOp->getOperand(1) == WidenIV && "one operand must be WideIV");
+    ClonedOp->setOperand(1, ScalarIV);
+  }
+  ClonedOp->insertAfter(ScalarIV->getDefiningRecipe());
+  return ClonedOp;
+}
+
 void VPlanTransforms::optimizeFindIVReductions(VPlan &Plan,
                                                PredicatedScalarEvolution &PSE,
                                                Loop &L) {
   ScalarEvolution &SE = *PSE.getSE();
   VPRegionBlock *VectorLoopRegion = Plan.getVectorLoopRegion();
 
-  // Helper lambda to check if the IV range excludes the sentinel value.
-  auto CheckSentinel = [&SE](const SCEV *IVSCEV, bool UseMax,
-                             bool Signed) -> std::optional<APInt> {
+  // Helper lambda to check if the IV range excludes the sentinel value. Try
+  // signed first, then unsigned.
+  auto CheckSentinel =
+      [&SE](const SCEV *IVSCEV,
+            bool UseMax) -> std::optional<std::pair<APInt, bool>> {
     unsigned BW = IVSCEV->getType()->getScalarSizeInBits();
-    APInt Sentinel =
-        UseMax
-            ? (Signed ? APInt::getSignedMinValue(BW) : APInt::getMinValue(BW))
-            : (Signed ? APInt::getSignedMaxValue(BW) : APInt::getMaxValue(BW));
-
-    ConstantRange IVRange =
-        Signed ? SE.getSignedRange(IVSCEV) : SE.getUnsignedRange(IVSCEV);
-    if (!IVRange.contains(Sentinel))
-      return Sentinel;
+    for (bool Signed : {true, false}) {
+      APInt Sentinel = UseMax ? (Signed ? APInt::getSignedMinValue(BW)
+                                        : APInt::getMinValue(BW))
+                              : (Signed ? APInt::getSignedMaxValue(BW)
+                                        : APInt::getMaxValue(BW));
+
+      ConstantRange IVRange =
----------------
ayalz wrote:

Ah, this use of min/max is in the final reduction of partial LastIV's, one per lane, after the loop.
For this min/max to work, the IV recorded inside the loop must not wrap.
If largest/smallest value works as sentinel, then this IV doesn't wrap.
If the IV wraps yet its IVRange isn't full, one could use a "canonical" IV instead which starts at 1, which will not wrap (unsigned). As if a binary expression "IV + offset" is used which turns a non-wrapping IV into a wrapping one. I.e., a case where a sentinel can be found for the original IV, but not for the expression. It is also possible that such an expression turns a wrapping IV into a non-wrapping one. Perhaps such cases should best be canonicalized/normalized to use a "canonical" IV (rather than derivatives thereof) which starts at 1 (rather than 0) and bumps by 1 (rather than VFxUF) - which never wraps, even if the original/vector IV may, because VFxUF > 1, sinking derivative expressions out of the loop; potentially hoisting back "shriking" expressions that may use IV of smaller bitwidth if possible and profitable.

https://github.com/llvm/llvm-project/pull/183911