[PATCH] D99750: [LV, VP]VP intrinsics support for the Loop Vectorizer

Tue Aug 15 14:04:53 PDT 2023

ABataev marked 5 inline comments as done.
ABataev added inline comments.

================
Comment at: llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h:84
+  bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
+                             Align Alignment) const;
+
----------------
Ayal wrote:
> Better called `hasEffectiveVectorLength()` as in hasEVL - for consistency?
> From the summary: "We use active vector length and explicit vector length interchangeably" - why? Could only EVL be used, consistently? (RISC-V documentation of setvl has AVL as a parameter and [E]VL as return/set value.)
> 
> Add Opcode/DataType/Alignment parameters later - when used?
This is separate. This function is already introduced in TTI, the purpose of the patch is not related to TTI directly.
As to params, it is done in the implementation.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8254
   if (OrigLoop->getHeader() == BB) {
-    if (!CM.blockNeedsPredicationForAnyReason(BB))
+    if (!CM.blockNeedsPredicationForAnyReason(BB) || CM.useVPVectorization())
       return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.
----------------
Ayal wrote:
> If `useVPVectorization()` is another reason to predicate the header BB, it should be included in `blockNeedsPredicationForAnyReason()`. But since the former requires tail folding, is it really another reason, or can it be dropped?
It is another reason and cannot be dropped

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9841
+        // vectorization.
+        if (Value *EVLPart = State.EVL ? State.get(State.EVL, Part) : nullptr) {
+          // if EVLPart is not null, we can vectorize using predicated
----------------
Ayal wrote:
> nit: `if (State.EVL) { Value *EVLPart = State.get(State.EVL, Part); ...`?
> 
> Why deal with Parts when EVL mandates UF=1?
The loop executes to State.UF (1 here, yes), so everything is fine here

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2898
+           "Unexpected scalar epilogue.");
+    return VectorTripCount = TC;
+  }
----------------
Ayal wrote:
> What is the meaning of setting VectorTripCount = TC?
We don't need to round vector trip count, so just set it to original trip count value (EVL is adjusted automatically to be not larger than trip count).

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5215
+                           "not generate VP intrinsics since interleave count "
+                           "specified is greater than 1.\n");
+      return MaxFactors;
----------------
Ayal wrote:
> It may be confusing to keep `PreferVPIntrinsics = false` when "Preference for VP intrinsics indicated".
> 
> The (names and) relation between `PreferPredicateWithVPIntrinsics` and `PreferVPIntrinsics` should be clarified.
> 
> Separate the part that sets PreferVPIntrinsics, possible switching on PreferPredicateWithVPIntrinsics?
PreferPredicateWithVPIntrinsics is just the option, that controls, whether we can emit vp intrinsics at all. If later we see, that it is not possible for some reason, we do not do it to still be able to produce correct code. I don't see what else should be adjusted here

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5222
+      if (PreferPredicateWithVPIntrinsics == EVLOption::IfEVLSupported) {
+        // FIXME: use actual opcode/data type for analysis here.
+        PreferVPIntrinsics = TTI.hasActiveVectorLength(0, nullptr, Align());
----------------
Ayal wrote:
> Add expected fail tests to cover opcode/data types that are not supported nor checked?
Currently there are no such test, since we support only loads/stores, which are supported by all potential targets

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9734

 void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
----------------
Ayal wrote:
> Recipes should strive to have straightforward code-gen as much as possible (contrary to "smart vector instructions/vp intrinsics emission" of the 2nd bullet in the summary's Tentative Development Roadmap).
> This is already challenged by the existing (non-EVL) VPWidenMemoryInstructionRecipe::execute.
> 
> Design dedicated recipe(s) for widening memory instructions under EVL, and introduce them instead of the existing non-EVL recipes, preferably as a VPlan-to-VPlan transformation, rather than try to fit everything here, and potentially elsewhere?
> Also discussed below in https://reviews.llvm.org/D99750#inline-967127, and iinm in earlier revisions.
Why? I t already handles masking, why it should not be extended with the handling of EVL? New recipe will not add anything new here, just will be copy/paste of the existing recipes.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2886
+    Value *TC = getTripCount();
+    // Loop has multiple exits. Make sure scalar remainder executes at least 1
+    // scalar iteration to perform correct jump.
----------------
Ayal wrote:
> ABataev wrote:
> > ABataev wrote:
> > > ABataev wrote:
> > > > fhahn wrote:
> > > > > ABataev wrote:
> > > > > > fhahn wrote:
> > > > > > > Is there a test with multiple exits?
> > > > > > It currently not supported by this patch. This is just an inital patch, it won't handle all the corner cases.
> > > > > Reading this back now, the check below is for loops requiring scalar epilogues, not necessarily multiple exits. Would be good to update the comment and also add a test to make sure this is captured. AFAICT this is something that the patch already handles and should be tested (e.g. a test with an interleave group that requires scalar epilogue)
> > > > Will add a test
> > > Will add a test for ultiple exits only, interleave groups are invalidated if foldTailByMasking() is true.
> > Double checked it, currently we cannot have this situation at all, so converting it to assert.
> > interleave groups are invalidated if foldTailByMasking() is true
> unless useMaskedInterleavedAccesses() is also true?
I can return check back, but currently cannot provide a test, since the target invalidates the costs for interleaved groups when making cost-based decisions.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5244
+
+      if (PreferVPIntrinsics)
+        MaxFactors.FixedVF = ElementCount::getFixed(1);
----------------
Ayal wrote:
> ABataev wrote:
> > fhahn wrote:
> > > Would be good to document what is going on here (effectively disabling fixed with vectorization and why)
> > Ok, will add
> So EVL is applied to scalable VFs only, right?
For now - yes.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:265-271
+namespace {
+enum class EVLOption {
+  NoPredication = 0,
+  IfEVLSupported,
+  ForceEVLSupport
+};
+} // namespace
----------------
Ayal wrote:
> ABataev wrote:
> > zlxu-rivai wrote:
> > > ABataev wrote:
> > > > zlxu-rivai wrote:
> > > > > IMHO, the vector length predication could be regarded as an another form of tail-folding, but not like ARM SVE which is based on `active.lane.mask` but based on an integer as vector length.
> > > > > 
> > > > > Based on this notion of unification, it would be better to augment the enum TailFoldingStyle, for example, add new entries like `TailFoldingStyle::DataWithEVL` (serving the purpose of `EVLOption::IfEVLSupport`) and the existent `TailFoldingStyle::None` can be reused as `EVLOption::NoPredication`.
> > > > Not necessarily. Generally speaking, we would like to support vectorized loop remainder in the future as one of the alternative solutions, so better to keep them separate.
> > > Forgive my ignorance, is there any scenario that epilogue vectorization is profitable when EVL is support? since in my understanding, eliminate loop epilogue is an important point of EVL.
> > It may be, at least for some time. Some of the optimizations work better for countable loops. It requires some extra work and time, before that, at least, vectorized loop remainder may be faster in some cases.
> Do you mean to say "Some optimizations work better for an invariant VF", i.e., that of VL0, rather than "for countable loops"?
> Vectorizing tail with EVL may still produce countable vector loops, as noted above.
Currently, vectorization with EVL will result in an uncountable loop. Tail vectorization is not always the best option for some targets

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2095
+/// intrinsic set_vector_length, that can be lowered to architecture specific
+/// instruction(s) to compute EVL.
+struct VPEVLRecipe : public VPRecipeBase, public VPValue {
----------------
Ayal wrote:
> But VPEVLRecipe seems to compute min(VF,TC-I) aka "the second way" in the above summary, rather than RISC-V's  special setvl(TC-I,SEW,LMUL) instruction aka third bullet there?
> 
> The use of "VF" here is confusing, as it is effectively used to compute VF... should it be MinVF or MaxVF instead? 
1. No, it emits setvl
2. There is no VF.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:2065
+/// for current tail-folding implementation.
+/// 2. The second way is to insert instructions to compute min(VF, trip_count -
+/// index) for each vector iteration.
----------------
fhahn wrote:
> Ayal wrote:
> > nikolaypanchenko wrote:
> > > ABataev wrote:
> > > > fhahn wrote:
> > > > > ABataev wrote:
> > > > > > fhahn wrote:
> > > > > > > ABataev wrote:
> > > > > > > > fhahn wrote:
> > > > > > > > > IIUC this is what the recipe actually computes at the moment, right?
> > > > > > > > No, currently it emits number 3. Other are not fully supported, they will require some extra work.
> > > > > > > Ah I see, would be good to update
> > > > > > Will update the comment
> > > > > Thanks for the update! IIUC options 1) and 2) don't need to be handled by `VPEVLRecipe`, 1) would just be a live-in VPValue and 2) could be a set VPInsructions to compute it. Would it make sense to only focus on 3) for EVL recipe.
> > > > > 
> > > > > It would be good to mention that the recipe effectively sets the EVL for the remainder of the region.
> > > > 1. Will remove 1 and 2 and will focus only on 3.
> > > > 2. It is not quite correct. For RISC-V it may return different values (< VLMAX) for 2 last iterations, not only for the remainder.
> > > To second @ABataev comment on 2): VL csr is controlled by `vset*vl*` instruction that has following behavior https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#63-constraints-on-setting-vl. Being more precise, with VLMAX=8 and AVL=9, it may either do (8, 1) or (5, 4).
> > Good clarification!
> > 
> > Would be good to explain at the outset "what the recipe actually computes"? (aka "3" - whose definition was lost somewhere?).
> > 
> > Trying to reason about this further: an original scalar loop
> > ```
> >   for(I=0; I<TC; ++I) {
> >     ...
> >   }
> > ```
> > processing Standard Elements of Width SEW can be vectorized, along with tail if needed, by doing
> > ```
> >   for (I=0, N=TC; I<TC; EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
> >     ...
> >   }
> > ```
> > The spec and comments above imply that, strictly speaking:
> > (a) `I` may advance in non-fixed increments, in the end, and thus is **not** an **Induction Variable** in general.
> > (b) the vector loop produced as such is **uncountable** in general - does not have a predetermined vector-trip-count.
> > (c) a countable vector loop can be produced by considering the VL of the first vector iteration, e.g.,:
> > ```
> >   VL0 = vsetlv(TC, SEW);
> >   VTC = ceil(TC/VL0);
> >   for (I=0, N=TC, VIV=0; VIV<VTC; ++VIV, EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
> >     ...
> >   }
> > ```
> > where `VIV` is an Induction Variable counting Vector Iterations and is separate from `I` which specifies the lanes processed in each vector iteration. A countable vector loop may be amenable to subsequent optimizations such as unrolling, modulo-scheduling, translation to hardware-loop.
> Right, I meant it sets it for the recipes in the remainder of the region. It can set a different VL on each iteration of the region
> Good clarification!
> 
> Would be good to explain at the outset "what the recipe actually computes"? (aka "3" - whose definition was lost somewhere?).
> 
> Trying to reason about this further: an original scalar loop
> ```
>   for(I=0; I<TC; ++I) {
>     ...
>   }
> ```
> processing Standard Elements of Width SEW can be vectorized, along with tail if needed, by doing
> ```
>   for (I=0, N=TC; I<TC; EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
>     ...
>   }
> ```
> The spec and comments above imply that, strictly speaking:
> (a) `I` may advance in non-fixed increments, in the end, and thus is **not** an **Induction Variable** in general.
> (b) the vector loop produced as such is **uncountable** in general - does not have a predetermined vector-trip-count.

And I described this already.

> (c) a countable vector loop can be produced by considering the VL of the first vector iteration, e.g.,:
> ```
>   VL0 = vsetlv(TC, SEW);
>   VTC = ceil(TC/VL0);
>   for (I=0, N=TC, VIV=0; VIV<VTC; ++VIV, EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
>     ...
>   }
> ```
> where `VIV` is an Induction Variable counting Vector Iterations and is separate from `I` which specifies the lanes processed in each vector iteration. A countable vector loop may be amenable to subsequent optimizations such as unrolling, modulo-scheduling, translation to hardware-loop.

Yes, we're going to implement this later.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:1157
+  // TODO: Restructure this code with an explicit remainder loop, vsetvli can be
+  // outside of the main loop that allows to use interleave or unroll in the
+  // main loop.
----------------
Ayal wrote:
> Placing vsetvli outside the main loop will compute VL0 as above. Idea of the TODO is to use VL0 repeatedly for all floor(TC/VL0) iterations excluding tail, thereby producing "explicit remainder loop" aka tail, and a countable main vector loop facilitating UF>1?
It is not possible, generally speaking, since not only tail might be affected

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99750/new/

https://reviews.llvm.org/D99750