[PATCH] D99750: [LV, VP]VP intrinsics support for the Loop Vectorizer

Tue Aug 15 08:48:51 PDT 2023

Ayal added a comment.

Joining this review late admittedly, adding comments relative to recent version.

What loops are amenable to use EVL by this patch should be clearly documented, tested, and potentially VPlanVerified:
Innermost loops, must not be unrolled, must not use masked interleave groups, must produce tail - which must be folded.
What about loops processing elements of multiple/mixed sizes?
What about reductions, fixed-order recurrences?

What is the benefit of using EVL on top of scalable vectors - only replacing the active-lane-mask / tail-folding compare setting the header mask? Can that be done as a VPlan-to-VPlan transform?

Should the title be more accurate, as it **Introduces** VP Intrinsics, or rather Introduces **EVL**? Seems also possible to first introduce VP Intrinsics while setting EVL to a maximum default, i.e., w/o actually introducing EVL, hence would be good to emphasize in the title.

================
Comment at: llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h:84
+  bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+                             Align Alignment) const;
+
----------------
Better called `hasEffectiveVectorLength()` as in hasEVL - for consistency?
>From the summary: "We use active vector length and explicit vector length interchangeably" - why? Could only EVL be used, consistently? (RISC-V documentation of setvl has AVL as a parameter and [E]VL as return/set value.)

Add Opcode/DataType/Alignment parameters later - when used?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5229
+                             "try to generate VP Intrinsics if the target "
+                             "support vector length predication.\n");
+      } else {
----------------
May as well also inform if the target supports it or not?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8254
   if (OrigLoop->getHeader() == BB) {
-    if (!CM.blockNeedsPredicationForAnyReason(BB))
+    if (!CM.blockNeedsPredicationForAnyReason(BB) || CM.useVPVectorization())
       return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.
----------------
If `useVPVectorization()` is another reason to predicate the header BB, it should be included in `blockNeedsPredicationForAnyReason()`. But since the former requires tail folding, is it really another reason, or can it be dropped?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9841
+        // vectorization.
+        if (Value *EVLPart = State.EVL ? State.get(State.EVL, Part) : nullptr) {
+          // if EVLPart is not null, we can vectorize using predicated
----------------
nit: `if (State.EVL) { Value *EVLPart = State.get(State.EVL, Part); ...`?

Why deal with Parts when EVL mandates UF=1?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:253

+// Option prefer-predicate-with-vp-intrinsics is an experimental switch to
+// indicate that the loop vectorizer should try to generate VP intrinsics if
----------------
Drop "experimental"?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:264
+// 2. force-explicit-vector-length-support: This is purely an
+// experimental/testing option which will be removed in future. It forces the
+// loop vectorizer to assume that the target supports vector length predication.
----------------
Drop "experimental" and "which will be removed in [the] future"?

Or drop altogether - is a switch enabling EVL for default target needed or suffice to test for concrete RISC-V EVL target?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:277
+    cl::Hidden,
+    cl::desc("When vectorizing with tail-folding, generate vector predication "
+             "intrinsics."),
----------------
Description talks about tail-folding, name of variable and command-line argument do not.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2898
+           "Unexpected scalar epilogue.");
+    return VectorTripCount = TC;
+  }
----------------
What is the meaning of setting VectorTripCount = TC?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5215
+                           "not generate VP intrinsics since interleave count "
+                           "specified is greater than 1.\n");
+      return MaxFactors;
----------------
It may be confusing to keep `PreferVPIntrinsics = false` when "Preference for VP intrinsics indicated".

The (names and) relation between `PreferPredicateWithVPIntrinsics` and `PreferVPIntrinsics` should be clarified.

Separate the part that sets PreferVPIntrinsics, possible switching on PreferPredicateWithVPIntrinsics?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5222
+      if (PreferPredicateWithVPIntrinsics == EVLOption::IfEVLSupported) {
+        // FIXME: use actual opcode/data type for analysis here.
+        PreferVPIntrinsics = TTI.hasActiveVectorLength(0, nullptr, Align());
----------------
Add expected fail tests to cover opcode/data types that are not supported nor checked?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9203
   addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), DebugLoc(),
-                        CM.getTailFoldingStyle());
+                        CM.getTailFoldingStyle(), CM.useVPVectorization());
   return Plan;
----------------
Restrict EVL to inner loop vectorization only, for now?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9734

 void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
----------------
Recipes should strive to have straightforward code-gen as much as possible (contrary to "smart vector instructions/vp intrinsics emission" of the 2nd bullet in the summary's Tentative Development Roadmap).
This is already challenged by the existing (non-EVL) VPWidenMemoryInstructionRecipe::execute.

Design dedicated recipe(s) for widening memory instructions under EVL, and introduce them instead of the existing non-EVL recipes, preferably as a VPlan-to-VPlan transformation, rather than try to fit everything here, and potentially elsewhere?
Also discussed below in https://reviews.llvm.org/D99750#inline-967127, and iinm in earlier revisions.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9880
+        // intrinsic.
+        if (EVLPart) {
+          auto *StoredValTy = cast<VectorType>(StoredVal->getType());
----------------
ABataev wrote:
> fhahn wrote:
> > For correctness, would we also need special handling for interleave groups?
> if foldTailByMasking() is true (for VP support it is true), interleave groups are invalidated.
> if foldTailByMasking() is true (for VP support it is true), interleave groups are invalidated.

unless useMaskedInterleavedAccesses() is also true?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9071
+        assert(isa<TruncInst>(Instr) ||
+               isa<VPEVLRecipe>(HeaderVPBB->getFirstNonPhi()));
         Recipe->insertBefore(*HeaderVPBB, HeaderVPBB->getFirstNonPhi());
----------------
fhahn wrote:
> ABataev wrote:
> > fhahn wrote:
> > > Is this still needed? The latest version only introduces `VPEVLRecipe` in `addCanonicalIVRecipes`.
> > Still needed. VPWidenIntOrFpInductionRecipe is emitted after addCanonicalIVRecipes and needs to be inserted as a PHI recipe. I'm not sure about TODO, I think it should be removed. But we still need to insert VPWidenIntOrFpInductionRecipe before VPEVLRecipe, which is emitted immediately after canonical IV recipe.
> I think the TODO is not really needed. As the EVL recipe needs to be created before other recipe construction, there's not much we can do about it.
(Another) Premature creation of a recipe out of place/time?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2886
+    Value *TC = getTripCount();
+    // Loop has multiple exits. Make sure scalar remainder executes at least 1
+    // scalar iteration to perform correct jump.
----------------
ABataev wrote:
> ABataev wrote:
> > ABataev wrote:
> > > fhahn wrote:
> > > > ABataev wrote:
> > > > > fhahn wrote:
> > > > > > Is there a test with multiple exits?
> > > > > It currently not supported by this patch. This is just an inital patch, it won't handle all the corner cases.
> > > > Reading this back now, the check below is for loops requiring scalar epilogues, not necessarily multiple exits. Would be good to update the comment and also add a test to make sure this is captured. AFAICT this is something that the patch already handles and should be tested (e.g. a test with an interleave group that requires scalar epilogue)
> > > Will add a test
> > Will add a test for ultiple exits only, interleave groups are invalidated if foldTailByMasking() is true.
> Double checked it, currently we cannot have this situation at all, so converting it to assert.
> interleave groups are invalidated if foldTailByMasking() is true
unless useMaskedInterleavedAccesses() is also true?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5244
+
+      if (PreferVPIntrinsics)
+        MaxFactors.FixedVF = ElementCount::getFixed(1);
----------------
ABataev wrote:
> fhahn wrote:
> > Would be good to document what is going on here (effectively disabling fixed with vectorization and why)
> Ok, will add
So EVL is applied to scalable VFs only, right?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:265-271
+namespace {
+enum class EVLOption {
+  NoPredication = 0,
+  IfEVLSupported,
+  ForceEVLSupport
+};
+} // namespace
----------------
ABataev wrote:
> zlxu-rivai wrote:
> > ABataev wrote:
> > > zlxu-rivai wrote:
> > > > IMHO, the vector length predication could be regarded as an another form of tail-folding, but not like ARM SVE which is based on `active.lane.mask` but based on an integer as vector length.
> > > > 
> > > > Based on this notion of unification, it would be better to augment the enum TailFoldingStyle, for example, add new entries like `TailFoldingStyle::DataWithEVL` (serving the purpose of `EVLOption::IfEVLSupport`) and the existent `TailFoldingStyle::None` can be reused as `EVLOption::NoPredication`.
> > > Not necessarily. Generally speaking, we would like to support vectorized loop remainder in the future as one of the alternative solutions, so better to keep them separate.
> > Forgive my ignorance, is there any scenario that epilogue vectorization is profitable when EVL is support? since in my understanding, eliminate loop epilogue is an important point of EVL.
> It may be, at least for some time. Some of the optimizations work better for countable loops. It requires some extra work and time, before that, at least, vectorized loop remainder may be faster in some cases.
Do you mean to say "Some optimizations work better for an invariant VF", i.e., that of VL0, rather than "for countable loops"?
Vectorizing tail with EVL may still produce countable vector loops, as noted above.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:1646
+  /// Returns true if VP intrinsics should be generated in the tail folded loop.
+  bool preferVPIntrinsics() const {
+    return foldTailByMasking() && PreferVPIntrinsics;
----------------
hiraditya wrote:
> nit: IMO the function name doesn't imply the semantics precisely.
> 
> nit: reorder to aviod call if possible
> ```cpp
> PreferVPIntrinsics && foldTailByMasking();
> ```
+1

The acronym `VP` is overloaded here and mainly short for "VPlan", as in VPInstruction.
Perhaps `VPI` would be clearer, short for VP Intrinsics.
Seems to mean "foldTailByEVL()" or "foldTailByVPI()"?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2095
+/// intrinsic set_vector_length, that can be lowered to architecture specific
+/// instruction(s) to compute EVL.
+struct VPEVLRecipe : public VPRecipeBase, public VPValue {
----------------
But VPEVLRecipe seems to compute min(VF,TC-I) aka "the second way" in the above summary, rather than RISC-V's  special setvl(TC-I,SEW,LMUL) instruction aka third bullet there?

The use of "VF" here is confusing, as it is effectively used to compute VF... should it be MinVF or MaxVF instead? 

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2067
+  /// Return VPValue representing vector trip count.
+  VPValue *getVectorTripCount() const { return getOperand(1); }
+
----------------
VPEVLRecipe holds no state, so deserves an opcode in VPInstruction rather than a dedicated recipe?

Regarding the two operands of VPEVLRecipe - strictly speaking neither an Induction Variable nor a Vector Trip Count seem applicable, following the above?
The current 2nd operand is received in the constructor as `TC`, but retrieved as `VTC`?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2065
+/// for current tail-folding implementation.
+/// 2. The second way is to insert instructions to compute min(VF, trip_count -
+/// index) for each vector iteration.
----------------
nikolaypanchenko wrote:
> ABataev wrote:
> > fhahn wrote:
> > > ABataev wrote:
> > > > fhahn wrote:
> > > > > ABataev wrote:
> > > > > > fhahn wrote:
> > > > > > > IIUC this is what the recipe actually computes at the moment, right?
> > > > > > No, currently it emits number 3. Other are not fully supported, they will require some extra work.
> > > > > Ah I see, would be good to update
> > > > Will update the comment
> > > Thanks for the update! IIUC options 1) and 2) don't need to be handled by `VPEVLRecipe`, 1) would just be a live-in VPValue and 2) could be a set VPInsructions to compute it. Would it make sense to only focus on 3) for EVL recipe.
> > > 
> > > It would be good to mention that the recipe effectively sets the EVL for the remainder of the region.
> > 1. Will remove 1 and 2 and will focus only on 3.
> > 2. It is not quite correct. For RISC-V it may return different values (< VLMAX) for 2 last iterations, not only for the remainder.
> To second @ABataev comment on 2): VL csr is controlled by `vset*vl*` instruction that has following behavior https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#63-constraints-on-setting-vl. Being more precise, with VLMAX=8 and AVL=9, it may either do (8, 1) or (5, 4).
Good clarification!

Would be good to explain at the outset "what the recipe actually computes"? (aka "3" - whose definition was lost somewhere?).

Trying to reason about this further: an original scalar loop
```
  for(I=0; I<TC; ++I) {
    ...
  }
```
processing Standard Elements of Width SEW can be vectorized, along with tail if needed, by doing
```
  for (I=0, N=TC; I<TC; EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
    ...
  }
```
The spec and comments above imply that, strictly speaking:
(a) `I` may advance in non-fixed increments, in the end, and thus is **not** an **Induction Variable** in general.
(b) the vector loop produced as such is **uncountable** in general - does not have a predetermined vector-trip-count.
(c) a countable vector loop can be produced by considering the VL of the first vector iteration, e.g.,:
```
  VL0 = vsetlv(TC, SEW);
  VTC = ceil(TC/VL0);
  for (I=0, N=TC, VIV=0; VIV<VTC; ++VIV, EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
    ...
  }
```
where `VIV` is an Induction Variable counting Vector Iterations and is separate from `I` which specifies the lanes processed in each vector iteration. A countable vector loop may be amenable to subsequent optimizations such as unrolling, modulo-scheduling, translation to hardware-loop.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:1177
+    Value *IV = State.get(getIV(), Part);
+    Value *TripCount = State.get(getVectorTripCount(), Part);
+    Value *EVL = State.Builder.CreateSub(TripCount, IV);
----------------
This indeed looks like the original `TripCount` rather than a `VectorTripCount`?

The Sub can be taken care of by a separate VPInstruction, which should feed VPEVLRecipe, rather than having the latter depend on IV and [Vector]TripCount.

The result EVL produced by VPEVLRecipe should feed its dependent recipes directly, rather than via State.EVL?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:1157
+  // TODO: Restructure this code with an explicit remainder loop, vsetvli can be
+  // outside of the main loop that allows to use interleave or unroll in the
+  // main loop.
----------------
Placing vsetvli outside the main loop will compute VL0 as above. Idea of the TODO is to use VL0 repeatedly for all floor(TC/VL0) iterations excluding tail, thereby producing "explicit remainder loop" aka tail, and a countable main vector loop facilitating UF>1?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:1159
+  // main loop.
+  assert(State.UF < 2 && "Neither unrolling, nor interleaving is supported by "
+                         "predicated vectorization.");
----------------
"interleave or unroll", "Neither unrolling, nor interleaving" - what's the distinction between unrolling and interleaving?

If UF must be 1 then the loop over UF should be omitted.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:292
+    Value *EVL = HasEVL ? State.get(getOperand(1), Part) : nullptr;
+    if (Part == 0 || EVL) {
       bool IsNUW = getOpcode() == VPInstruction::CanonicalIVIncrementNUW;
----------------
fhahn wrote:
> fhahn wrote:
> > IIUC the `EVL` check here is not needed? Can have an asset that `EVL` implies `Part == 0`. IIUC Part > 0 could be a problem is that there could be different EVL's for different parts?
> I spent some time thinking on how EVL could
> be modeled more explicitly here and shared D157322 which adds an argument for the increment which could be set to EVL when using it, removing the need to modify the recipe itself here
Or keep CanonicalIVIncrement intact using an implicit VF*UF step, and introduce separate recipe(s) to take care of controlling the loop and/or computing the lanes each vector iteration, under EVL?

nit: consider checking asserts in VPlanVerifier instead.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanValue.h:353
     VPWidenSelectSC,
+    VPEVLSC,
     // START: Phi-like recipes. Need to be kept together.
----------------
nit: lex order.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp:205

+  // Check if VPEVLRecipe exists only in Entry block and only once.
+  bool VPEVLFound = false;
----------------
nit: retrieve Entry from the loop region enclosing VPBB, rather than pass it in.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp:208
+  auto CheckEVLRecipies = [&](const VPRecipeBase *R) {
+    if (isa<VPEVLRecipe>(R)) {
+      if (VPBB != Entry || VPEVLFound)
----------------
nit: may as well dump the error message here.

A loop region under EVL may deserve its own verification to check its various constraints, e.g., no inner replicate regions, no replicate recipes, UF=1 only - should also be indicated when printed.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99750/new/

https://reviews.llvm.org/D99750