[PATCH] D75069: [LoopVectorizer] Inloop vector reductions

Mon Jul 6 13:15:58 PDT 2020

Ayal added inline comments.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:439
     //       model could then apply the recurrence type to these instructions,
     //       without needing a white list of instructions to ignore.
     collectCastsToIgnore(TheLoop, ExitInstruction, RecurrenceType, CastInsts);
----------------
dmgreen wrote:
> Ayal wrote:
> > Perhaps the above "better way" would also help recognize and record horizontal reductions?
> Hmm. The reason I kept them separate was that this method is already pretty complex. I was trying to keep thing simpler. Adding the ability to detect a single chain of operations from Phi to LoopExitValue that can be used for horizontal reductions looks.. difficult. And error prone. :) If you think it's worth it then I can certainly give it a go! I like the separation of concerns in keeping them separate though.
> 
> The extra things that AddReductionVar will detect that getReductionOpChain will not are:
>  - Phi/select predicated reductions like in if-conversion-reductions.ll and if-reduction.ll. These would need some form of predicated reduction intrinsic.
>  - Narrow bitwidths. This one I could add.
>  - Subs/FSubs are treated like Adds/FAdds for vertical reductions.
Would be good to see if above TODO can be addressed - providing the set of all instructions that take part in the reduction. This set could then be used for checking in-loop reductions. Hopefully this could help simplify both, and keep them in some sync. But could be done later, possibly with another TODO..

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:848
+
+  if (!isCorrectOpcode(Cur))
+    return {};
----------------
dmgreen wrote:
> Ayal wrote:
> > `|| !Cur->hasNUses(ExpectedUses)` ?
> > 
> > 
> > nit: can alternatively let getNextInstruction check its result and return only valid ones, e.g.:
> > 
> > ```
> > bool RedOpIsCmp = (RedOp == Instruction::ICmp || RedOp == Instruction::FCmp);
> > unsigned ExpectedUses = RedOpIsCmp ? 2 : 1;
> > 
> > auto getNextInstruction = [&](Instruction *Cur) {
> >   if (!Cur->hasNUses(ExpectedUses))
> >     return nullptr;
> >   auto *FirstUser = Cur->user_begin();
> >   if (!RedOpIsCmp)
> >     return FirstUser->getOpcode() == RedOp ? FirstUser : nullptr;
> >   // Handle cmp/select pair:
> >   auto *Sel = dyn_cast<SelectInst>(*FirstUser) ||
> >     dyn_cast<SelectInst>(*std::next(FirstUser));
> >   if (SelectPatternResult::isMinOrMax(matchSelectPattern(Sel, LHS, RHS).Flavor))
> >     return Sel;
> >   return nullptr;
> > }
> > 
> > for (auto *Cur = getNextInstruction(Phi); Cur && Cur != LoopExitInstr; Cur = getNextInstruction(Cur))
> >   ReductionOperations.push_back(Cur);
> > ```
> This is the loop exit instr, so can have as many uses as it likes I believe.
Ahh, ok. (It should have ExpectedUses+1 users being in lcssa.)

"instruciton"

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:856
+  // The loop exit instruction also needs to be the same opcode. We dont allow
+  // them to be Subs.
+  if (!isCorrectOpcode(Cur))
----------------
Is Subs the only issue?.
Can check this earlier, before traversing the chain, although it is pushed back last, here.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:812
+    if (LHS->getOpcode() == Opcode && L->contains(LHS->getParent()) &&
+        LHS->hasOneUse() &&
+        findPathToPhi(LHS, ReductionOperations, Opcode, Phi, L)) {
----------------
dmgreen wrote:
> fhahn wrote:
> > Ayal wrote:
> > > Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?
> > > 
> > > Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.
> > Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at  a single use chain?
> Yeah, that direction makes it a lot simpler. Thanks.
Is treating sub as an add reduction something in-loop reduction could support as a future extension?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7516
+    VPlanPtr &Plan, VPRecipeBuilder &RecipeBuilder) {
+  for (auto &Reduction : Legal->getReductionVars()) {
+    PHINode *Phi = Reduction.first;
----------------
This is the other potential use of for (auto &Reduction : CM.getInloopReductions()).

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:1041
+  /// outside.
+  void categorizeReductions();
+
----------------
Ayal wrote:
> collectInLoopReductions()? Perhaps worth holding a map of in loop reduction phi's with their chains.
Thanks. Worth updating the comment.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3769
     // MinMax reduction have the start value as their identify.
-    if (VF == 1) {
+    if (VF == 1 || UseInloopReductions) {
       VectorStart = Identity = ReductionStartValue;
----------------
dmgreen wrote:
> Ayal wrote:
> > dmgreen wrote:
> > > Ayal wrote:
> > > > This is dead code if cmp/select chains are not recognized yet, as noted above.
> > > I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).
> > > 
> > > MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)
> > > 
> > > My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.
> > Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize?
> -Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions?
Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link.

> And predicated reductions?
These are yet to be handled in-loop, right?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:6537
+    // want to record it as such.
+    if (!ForceInloopReductions)
+      continue;
----------------
dmgreen wrote:
> Ayal wrote:
> > Move this invariant check out as another early-exit?
> This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like:
>   if (!PreferInloopReductions &&
>         !TTI.preferInloopReduction(Opcode, Phi->getType(),
>                                    TargetTransformInfo::ReductionFlags()))
>       continue;
Then better placed above right after defining Phi?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7286
+    RecurrenceDescriptor &RdxDesc = Reduction.second;
+    if (CM.useInloopReductions(Reduction.first)) {
+      PHINode *Phi = Reduction.first;
----------------
dmgreen wrote:
> Ayal wrote:
> > dmgreen wrote:
> > > Ayal wrote:
> > > > Iterate over in loop reductions?
> > > Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?
> > > 
> > > It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?
> > Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().
> > 
> > (Better early-exit via "if (!CM.isInLoopReduction(Reduction.first)) continue;")
> I believe that InloopReductionChains would not iterate in a deterministic order, which is why I avoided it.
> 
> Perhaps that would not matter here? The reductions should be independent anyway. Seems safer to try and use deterministic ordering anyway if we can.
Agreed it would be better to use deterministic ordering. How about letting InloopReductionChains be a MapVector and iterate over
for (auto &Reduction : CM.getInloopReductions())?
The number of reductions is expected to be small, w/o removals.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75069/new/

https://reviews.llvm.org/D75069