[PATCH] D99750: [LV, VP]VP intrinsics support for the Loop Vectorizer

Mon Oct 16 04:08:46 PDT 2023

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:832
+      continue;
+    auto *NewInst =
+        new VPInstruction(VPInstruction::ExplicitVectorLengthIVIncrement,
----------------
fhahn wrote:
> craig.topper wrote:
> > ABataev wrote:
> > > fhahn wrote:
> > > > ABataev wrote:
> > > > > fhahn wrote:
> > > > > > ABataev wrote:
> > > > > > > fhahn wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > fhahn wrote:
> > > > > > > > > > I think turning the step of the canonical induction non-loop-invariant technically turns the canonical IV into a phi that's not a canonical IV any more (which is guaranteed to step the same amount each iteration). Would it work to keep the increment unchanged and keep rounding up the trip count was with regular tail folding initially? Further down the line, the canonical IV issue may be resolved by also replacing the canonical IV node with a regular scalar phi when doing the replacement here.
> > > > > > > > > I'll try to improve this.
> > > > > > > > Did you get a chance to try this out yet? 
> > > > > > > > 
> > > > > > > > 97687b7aea17 landed, it would probably be good to also remove the header mask from load/store recipes here, to make clear that this optimizes the tail-folded loop?
> > > > > > > Already did. The loop is countable, adding a new phi won't give anything, just some extra work without any effect.
> > > > > > Oh right I missed that, sorry! 
> > > > > > 
> > > > > > Does the latest version actually have to update the canonical IV increment? 
> > > > > > 
> > > > > > I might be missing something, but shouldn't the exit condition now use the rounded up version (a multiple of the VF) of the trip count for the compare, so if we increment by EVL then the we might not reach the exit condition?
> > > > > There are 2 increments now: 2 than feeds into PHI and another one used for exit condition. It uses rounded version of trip count for comparison.
> > > > Right, but why both are needed? Can there be more than 1 iteration where EVL < VF?
> > > The first one is used to increment IV, another one - to check for number of iterations. We need the first one (which is vsetvli dependable) to correctly count IV.
> > > But looks like I need to adjust the IV here, because otherwise we may have wrong comparison. I'll think more about it.
> > > Right, but why both are needed? Can there be more than 1 iteration where EVL < VF?
> > 
> > Yes the last 2 iterations can have an EVL less than VL for RISC-V. The vsetvli instruction on RISC-V takes an input called AVL that contains the number of values to process and returns VL subject to the following constraints:
> > 
> > 1. vl = AVL if AVL ≤ VLMAX
> > 2. ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX)
> > 3. vl = VLMAX if AVL ≥ (2 * VLMAX)
> > 
> > Bullet 2 there allows the AVL between VLMAX and 2*VLMAX to be split over the last 2 iterations. Not all microarchitectures implement this.
> > 
> > On each iteration VL cannot be larger than the VL on the previous iteration.
> Thanks for clarifying! Does the current codegen in the patch work correctly for cases where we execute more than 1 iteration for  `EVL < VF`? IIUC the current approach with rounding up the trip count and using VF as increment assumes only one extra iteration.
The total number of iterations is the same, just the vector length changes by balancing the value. If EVL is less than VLMAX, EVL is used as vector length. Only if VLMAX < AVL < 2 * VLMAX some magic may happen, i.e. in last 2 (vectorized) iterations.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99750/new/

https://reviews.llvm.org/D99750