[llvm] [LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL. (PR #76172)

Sun Mar 24 06:40:00 PDT 2024

================
@@ -1201,20 +1240,75 @@ void VPlanTransforms::addActiveLaneMask(
   // Walk users of WideCanonicalIV and replace all compares of the form
   // (ICMP_ULE, WideCanonicalIV, backedge-taken-count) with an
   // active-lane-mask.
-  VPValue *BTC = Plan.getOrCreateBackedgeTakenCount();
-  for (VPUser *U : SmallVector<VPUser *>(WideCanonicalIV->users())) {
-    auto *CompareToReplace = dyn_cast<VPInstruction>(U);
-    if (!CompareToReplace ||
-        CompareToReplace->getOpcode() != Instruction::ICmp ||
-        CompareToReplace->getPredicate() != CmpInst::ICMP_ULE ||
-        CompareToReplace->getOperand(1) != BTC)
-      continue;
+  replaceHeaderPredicateWithIdiom(Plan, *LaneMask);
+}
 
-    assert(CompareToReplace->getOperand(0) == WideCanonicalIV &&
-           "WidenCanonicalIV must be the first operand of the compare");
-    CompareToReplace->replaceAllUsesWith(LaneMask);
-    CompareToReplace->eraseFromParent();
+/// Add a VPEVLBasedIVPHIRecipe and related recipes to \p Plan and
+/// replaces all uses except the canonical IV increment of
+/// VPCanonicalIVPHIRecipe with a VPEVLBasedIVPHIRecipe. VPCanonicalIVPHIRecipe
+/// is used only for loop iterations counting after this transformation.
+///
+/// The function uses the following definitions:
+///  %StartV is the canonical induction start value.
+///
+/// The function adds the following recipes:
+///
+/// vector.ph:
+/// ...
+///
+/// vector.body:
+/// ...
+/// %P = EXPLICIT-VECTOR-LENGTH-BASED-IV-PHI [ %StartV, %vector.ph ], [
+/// %NextEVL, %vector.body ] %EVL = EXPLICIT-VECTOR-LENGTH %P, original TC
+/// ...
+/// %NextEVL = EXPLICIT-VECTOR-LENGTH + %P, %EVL
+/// ...
+///
+void VPlanTransforms::addExplicitVectorLength(VPlan &Plan) {
+  VPBasicBlock *Header = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+  auto *CanonicalIVPHI = Plan.getCanonicalIV();
+  VPValue *StartV = CanonicalIVPHI->getStartValue();
+
+  // Walk users of WideCanonicalIV and replace all compares of the form
+  // (ICMP_ULE, WideCanonicalIV, backedge-taken-count) with an
+  // all-true-mask.
+  Value *TrueMask =
+      ConstantInt::getTrue(CanonicalIVPHI->getScalarType()->getContext());
+  VPValue *VPTrueMask = Plan.getVPValueOrAddLiveIn(TrueMask);
+  replaceHeaderPredicateWithIdiom(Plan, *VPTrueMask, [](VPUser &U, unsigned) {
+    return isa<VPWidenMemoryInstructionRecipe>(U);
+  });
+  // Now create the ExplicitVectorLengthPhi recipe in the main loop.
+  auto *EVLPhi = new VPEVLBasedIVPHIRecipe(StartV, DebugLoc());
+  EVLPhi->insertAfter(CanonicalIVPHI);
+  auto *VPEVL = new VPInstruction(VPInstruction::ExplicitVectorLength,
+                                  {EVLPhi, Plan.getTripCount()});
+  VPEVL->insertBefore(*Header, Header->getFirstNonPhi());
+
+  auto *CanonicalIVIncrement =
+      cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());
+  VPSingleDefRecipe *OpVPEVL = VPEVL;
+  if (CanonicalIVPHI->getScalarType() !=
+      IntegerType::get(CanonicalIVPHI->getScalarType()->getContext(),
+                       /*NumBits=*/32)) {
+    OpVPEVL = new VPScalarCastRecipe(Instruction::ZExt, OpVPEVL,
+                                     CanonicalIVPHI->getScalarType());
+    OpVPEVL->insertBefore(CanonicalIVIncrement);
   }
+  auto *NextEVLIV =
+      new VPInstruction(Instruction::Add, {OpVPEVL, EVLPhi},
+                        {CanonicalIVIncrement->hasNoUnsignedWrap(),
+                         CanonicalIVIncrement->hasNoSignedWrap()},
----------------
ayalz wrote:

There are two potential cases where the canonical IV increment may wrap: if the original TC=BTC+1 wraps, and if rounding-up TC to a multiple of VF*UF as part of tail folding wraps. When using EVL the computation BTC+1 must not wrap because EVL uses TC explicitly, and EVL avoids any rounding-up of TC. So does OpVPEVL+EVLPhi not wrap, regardless of CanonicalIVIncrement?
In any case, emitIterationCountCheck() seems to currently ensure that neither the canonical IV increment nor the EVL increment may wrap, when tail folding with EVL.

>> Why depend on CanonicalIVIncrement, can simply place the cast and the bump after VPEVL in the header?
>
> The increment is placed after CanonicalIVIncrement, so just want to keep ext close to next evl.

It would be good for EVL to be independent of CanonicalIVIncrement, e.g., allowing the latter to be introduced late in the VPlan-to-VPlan pipeline (see https://github.com/llvm/llvm-project/pull/82270).

https://github.com/llvm/llvm-project/pull/76172