[PATCH] D99750: [LV, VP]VP intrinsics support for the Loop Vectorizer

Fri Aug 11 14:35:15 PDT 2023

fhahn added a comment.

Thanks for the update! @Ayal & I discussed some potential cleanups and improvements in the VPlan infrastructure that could help to reduce the need to add some extra code to existing recipes to perform simplifications and for explicit EVL handling.

One motivating thought was to think of tail folding and EVL introduction as gradual refinement/optimizations that restrict the  number of ‚active‘ vector elements of the last iteration(s). Hence we could start out with an initial plan that doesn’t have EVL/tail folded and introducing EVL/tail folding as optimization of the original plan.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9845
+    if (auto *IMask = dyn_cast<VPInstruction>(Mask))
+      if (IMask->getOpcode() == VPInstruction::ICmpULE)
+        return Builder.getTrueVector(EC);
----------------
ABataev wrote:
> fhahn wrote:
> > ABataev wrote:
> > > fhahn wrote:
> > > > It's not clear to me what the reasoning is here to simplify this to the all-true mask without checking the operands of the compare. If the compare can be simplified to the all-true mask, then this would be suitable to do as VPlan-to-VPlan simplification instead of during codegen.
> > > Added extra check.
> > Thanks for the update. Looking at this again, it seems like it would be better to either perform this simplification when creating the tail-folding mask or optimize as VPlan-to-VPlan transform, as presumably this applies to all users of the top-level mask? Might be good to tie in with D157037 which would always create the top-level mask in the beginning when tail-folding.
> If I understand your proposal correctly, it will require adding special VPValue like AllTrueVPValue. This lambda returns Value *, no VPValue, so it cannot be done directly without adding new live-in(?).
I think it would be preferable to handle this more generically during VPlan optimizations.

I’ve been working on moving tail folding to a Vallant transform instead of combining it with regular block mask creation (D157713)

With that, this simplification could be handled generically by not adding the header mask during the tail folding transform.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2065
+/// for current tail-folding implementation.
+/// 2. The second way is to insert instructions to compute min(VF, trip_count -
+/// index) for each vector iteration.
----------------
nikolaypanchenko wrote:
> ABataev wrote:
> > fhahn wrote:
> > > ABataev wrote:
> > > > fhahn wrote:
> > > > > ABataev wrote:
> > > > > > fhahn wrote:
> > > > > > > IIUC this is what the recipe actually computes at the moment, right?
> > > > > > No, currently it emits number 3. Other are not fully supported, they will require some extra work.
> > > > > Ah I see, would be good to update
> > > > Will update the comment
> > > Thanks for the update! IIUC options 1) and 2) don't need to be handled by `VPEVLRecipe`, 1) would just be a live-in VPValue and 2) could be a set VPInsructions to compute it. Would it make sense to only focus on 3) for EVL recipe.
> > > 
> > > It would be good to mention that the recipe effectively sets the EVL for the remainder of the region.
> > 1. Will remove 1 and 2 and will focus only on 3.
> > 2. It is not quite correct. For RISC-V it may return different values (< VLMAX) for 2 last iterations, not only for the remainder.
> To second @ABataev comment on 2): VL csr is controlled by `vset*vl*` instruction that has following behavior https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#63-constraints-on-setting-vl. Being more precise, with VLMAX=8 and AVL=9, it may either do (8, 1) or (5, 4).
Right, I meant it sets it for the recipes in the remainder of the region. It can set a different VL on each iteration of the region

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:292
+    Value *EVL = HasEVL ? State.get(getOperand(1), Part) : nullptr;
+    if (Part == 0 || EVL) {
       bool IsNUW = getOpcode() == VPInstruction::CanonicalIVIncrementNUW;
----------------
fhahn wrote:
> IIUC the `EVL` check here is not needed? Can have an asset that `EVL` implies `Part == 0`. IIUC Part > 0 could be a problem is that there could be different EVL's for different parts?
I spent some time thinking on how EVL could
be modeled more explicitly here and shared D157322 which adds an argument for the increment which could be set to EVL when using it, removing the need to modify the recipe itself here

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99750/new/

https://reviews.llvm.org/D99750