[PATCH] D99750: [LV, VP]VP intrinsics support for the Loop Vectorizer

Mon Oct 16 03:33:19 PDT 2023

fhahn added a comment.

Thanks for the latest update! It looks like some of the unit tests (`check-llvm-unit`) are failing (maybe related to the verifier changes?), would be good to check and update them.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5732

+  // Do not interleave if VP intrinsics are preferred and no User IC is
+  // specified.
----------------
nit: if EVL is preferred?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9619

+  auto MaskValue = [&](unsigned Part, ElementCount EC) -> Value * {
+    if (isMaskRequired)
----------------
nit: `EC` unused?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9653
+        if (Value *EVLPart = State.EVL ? State.get(State.EVL, Part) : nullptr) {
+          // if EVLPart is not null, we can vectorize using predicated
+          // intrinsic.
----------------
nit: `If EVLPart`

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2144
+  VPExplicitVectorLengthExitPHIRecipe(VPValue *StartMask, DebugLoc DL)
+      : VPHeaderPHIRecipe(VPDef::VPActiveLaneMaskPHISC, nullptr, StartMask,
+                          DL) {}
----------------
`VPExplicitVectorLengthExitPHISC`?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2155
+
+  /// Generate the active lane mask phi of the vector loop.
+  void execute(VPTransformState &State) override;
----------------
nit: active lane mask phi?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:1725
+
+// TODO: It would be good to use the existing VPWidenPHIRecipe instead and
+// remove VPExplicitvectorLengthPHIRecipe.
----------------
nit: VPWidenPHIRecipe implies vector phi, but this is a scalar phi.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:966
+/// the given idion \p Idiom.
+static void replaceExitConditionWithIdiom(VPlan &Plan, VPValue &Idiom) {
+  auto *FoundWidenCanonicalIVUser =
----------------
I think using `Exit` condition may be confusing, it replaces the predicate for the header mask.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:1026
+void VPlanTransforms::addVectorPredication(VPlan &Plan) {
+  VPBasicBlock *Entry = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+  // Now create the ExplicitVectorLengthExitPhi recipe in the main loop using
----------------
nit: `Entry->Header`

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:1028
+  // Now create the ExplicitVectorLengthExitPhi recipe in the main loop using
+  // the preheader ActiveLaneMask instruction.
+  auto *EVLExitPhi = new VPExplicitVectorLengthExitPHIRecipe(
----------------
How ActiveLaneMask factor in here?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:1037
+  VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();
+  VPRecipeBase *Term = EB->getTerminator();
+  auto EndIter = Term ? Term->getIterator() : EB->end();
----------------
Would be good to describe here what recipes are added and what's changed.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:1039
+  auto EndIter = Term ? Term->getIterator() : EB->end();
+  for (VPRecipeBase &Ingredient :
+       make_early_inc_range(make_range(EB->begin(), EndIter))) {
----------------
Simpler to first get the canonical IV via `getCanonicalIV` and use it to access its increment?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:1055
+    EVLExitPhi->addOperand(I);
+    I->setOperand(0, EVLExitPhi->getVPSingleValue());
   }
----------------
IIUC this introduces `EVLExitPhi` and uses it for the canonical IV increment only, which controls the exit and adjust the canonical IV. Would it be possible to do it the other way around, i.e. keep the canonical induction incremented by the canonical IV increment (thus keeping it canonical), and instead have `EVLExitPhi` updated by `ExplicitVectorLengthIVIncrement`, and updating the relevant users (all except the canonical increment?) to use that one?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:832
+      continue;
+    auto *NewInst =
+        new VPInstruction(VPInstruction::ExplicitVectorLengthIVIncrement,
----------------
craig.topper wrote:
> ABataev wrote:
> > fhahn wrote:
> > > ABataev wrote:
> > > > fhahn wrote:
> > > > > ABataev wrote:
> > > > > > fhahn wrote:
> > > > > > > ABataev wrote:
> > > > > > > > fhahn wrote:
> > > > > > > > > I think turning the step of the canonical induction non-loop-invariant technically turns the canonical IV into a phi that's not a canonical IV any more (which is guaranteed to step the same amount each iteration). Would it work to keep the increment unchanged and keep rounding up the trip count was with regular tail folding initially? Further down the line, the canonical IV issue may be resolved by also replacing the canonical IV node with a regular scalar phi when doing the replacement here.
> > > > > > > > I'll try to improve this.
> > > > > > > Did you get a chance to try this out yet? 
> > > > > > > 
> > > > > > > 97687b7aea17 landed, it would probably be good to also remove the header mask from load/store recipes here, to make clear that this optimizes the tail-folded loop?
> > > > > > Already did. The loop is countable, adding a new phi won't give anything, just some extra work without any effect.
> > > > > Oh right I missed that, sorry! 
> > > > > 
> > > > > Does the latest version actually have to update the canonical IV increment? 
> > > > > 
> > > > > I might be missing something, but shouldn't the exit condition now use the rounded up version (a multiple of the VF) of the trip count for the compare, so if we increment by EVL then the we might not reach the exit condition?
> > > > There are 2 increments now: 2 than feeds into PHI and another one used for exit condition. It uses rounded version of trip count for comparison.
> > > Right, but why both are needed? Can there be more than 1 iteration where EVL < VF?
> > The first one is used to increment IV, another one - to check for number of iterations. We need the first one (which is vsetvli dependable) to correctly count IV.
> > But looks like I need to adjust the IV here, because otherwise we may have wrong comparison. I'll think more about it.
> > Right, but why both are needed? Can there be more than 1 iteration where EVL < VF?
> 
> Yes the last 2 iterations can have an EVL less than VL for RISC-V. The vsetvli instruction on RISC-V takes an input called AVL that contains the number of values to process and returns VL subject to the following constraints:
> 
> 1. vl = AVL if AVL ≤ VLMAX
> 2. ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX)
> 3. vl = VLMAX if AVL ≥ (2 * VLMAX)
> 
> Bullet 2 there allows the AVL between VLMAX and 2*VLMAX to be split over the last 2 iterations. Not all microarchitectures implement this.
> 
> On each iteration VL cannot be larger than the VL on the previous iteration.
Thanks for clarifying! Does the current codegen in the patch work correctly for cases where we execute more than 1 iteration for  `EVL < VF`? IIUC the current approach with rounding up the trip count and using VF as increment assumes only one extra iteration.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99750/new/

https://reviews.llvm.org/D99750