[PATCH] D132590: [SLP] Try to match reductions first in a vector build sequence.

Thu Aug 25 16:55:29 PDT 2022

vdmitrie added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:11820-11823
+  SmallVector<WeakTrackingVH> PostponedInsts;
+  for (Value *Op : BuildVectorOpds)
+    OpsChanged |=
+        vectorizeHorReduction(nullptr, Op, BB, R, TTI, PostponedInsts);
----------------
ABataev wrote:
> vdmitrie wrote:
> > ABataev wrote:
> > > vdmitrie wrote:
> > > > ABataev wrote:
> > > > > vdmitrie wrote:
> > > > > > ABataev wrote:
> > > > > > > vdmitrie wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > vdmitrie wrote:
> > > > > > > > > > ABataev wrote:
> > > > > > > > > > > Why do you want still call this after `findBuildAggregate`?
> > > > > > > > > > In order to match hor reduction on build vector operands. Can you offer other approach to do the same?
> > > > > > > > > But vectorizeRootInstruction already does this, why do you want to do it twice, here and in vectorizeRootInstruction? It increases compile time.
> > > > > > > > > Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after? And remove this loop and final OpsChanged |= tryToVectorize(PostponedInsts, R); call?
> > > > > > > > > But vectorizeRootInstruction already does this, 
> > > > > > > > 
> > > > > > > > Nope. This is exactly what does not happen after your commit https://reviews.llvm.org/rG7ea03f0b4e4e
> > > > > > > > Before the commit we traversed through operands of insertelements  when collected instructions for work stack and were able to locate all reductions at once.
> > > > > > > > 
> > > > > > > > > Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after?
> > > > > > > > 
> > > > > > > > Nope. vectorizeRootInstruction may too early create 2-way vectorizations. We only want to call it after trying on with wider VFs with tryToVectorizeList.
> > > > > > > > 
> > > > > > > > 
> > > > > > > Then you don't need to call vectorizeRootInstruction. Also, why do you need to find operands of the buildvector sequence? Just call vectorizeHorReduction for each insert instruction in the loop and in the second loop call findBuildAggregate and all other stuff for buildvector and postponed insts.
> > > > > > > Just call vectorizeHorReduction for each insert ...
> > > > > > 
> > > > > > I don't like this approach. I'd rather enable vectorizeHorReduction to traverse through insertelement operands. BTW you did not explain why you suppressed that.
> > > > > > 
> > > > > To save compile time, to avoid analysis of the same instructions several times.
> > > > > Why you don't like it?
> > > > > To save compile time, to avoid analysis of the same instructions several times.
> > > > > Why you don't like it?
> > > > 
> > > > You are probably talking about https://reviews.llvm.org/D114171 when saying "compile time". But I'm about the different change. Note that  the patch that was last reviewed in https://reviews.llvm.org/D114171
> > > > does not match what was actually committed (with the reference to differential revision). I'm specifically concerned about this piece of code:
> > > > 
> > > > ```
> > > >     // Try to vectorize operands.
> > > >     // Continue analysis for the instruction from the same basic block only to
> > > >     // save compile time.
> > > >     if (++Level < RecursionMaxDepth)
> > > >       for (auto *Op : Inst->operand_values())
> > > >         if (VisitedInstrs.insert(Op).second)
> > > >           if (auto *I = dyn_cast<Instruction>(Op))
> > > >             // Do not try to vectorize CmpInst operands,  this is done
> > > >             // separately.
> > > >             if (!isa<PHINode>(I) && !isa<CmpInst>(I) && !R.isDeleted(I) &&
> > > >                 I->getParent() == BB)
> > > >               Stack.emplace(I, Level);
> > > > ```
> > > > 
> > > > where you changed to avoid traversing through insertelement operands:
> > > > ```
> > > >             if (!isa<PHINode, CmpInst, InsertElementInst, InsertValueInst>(I) &&
> > > > ```
> > > > I'm not convinced that this specific change worth extra compile time.
> > > > 
> > > > I can agree that sticking to buildvector is not the best approach, but not traversing the operands early does not save compile time. It only leads to postponing the action for another visit but with much lower chances for vectorizing it the right way.
> > > > 
> > > > 
> > > It was changed after several comments regarding compile time.
> > > There are 2 problems with the original code.
> > > 1. Compile time in case of unsuccessful attempts. We repeat analysis of same instructions at least twice, if not more.
> > > 2. It does not preseve the analysis order for postponable instructions, some of them gets analyzed too early. It breaks internal logic, makes it harder for the perf abd bug analysis. And may lead to a too early vectorization.
> > > 
> > > The problem that insertelement is also the operand and it maybe a part of the another build vector sequence. In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence.
> > 
> > >  In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence.
> > 
> > Now vectorizeHorRedcution is decoupled from 2-way vectorization. we just need to call vectorizeHorRedcution instead of  vectorizeRootInstruction in order to avoid unwanted 2-way vectors early.
> > 
> And it is good. But then there is a call for vectorizeRootInstruction, which just repeats almost same stuff again. If you want to keep current implementation, I suggest to remove the call for vectorizeRootInstruction and call it if findBuildAggregate fails only, like
> ```
> 
> if (!findBuildAggregate(I, TTI, BuildVectorOpds, BuildVectorInsts))
>     return vectorizeRootInstruction(...);
> 
> ```
Yep. I was thinking about that too.
I also tried to add an extra loop over instructions just targeting reductions in vectorizeSimpleInstructions:

```
   // pass1 - try to vectorize reductions only
   for (auto *I : reverse(Instructions)) {
     if (R.isDeleted(I))
       continue;
     if (isa<CmpInst>(I)) {
       PostponedCmps.push_back(I);
      continue;
     }
     OpsChanged |= vectorizeHorReduction(nullptr, I, BB, R, TTI, PostponedInsts);
   }
   // pass2 - try to match and vectorize a buildvector sequence.
   for (auto *I : reverse(Instructions)) {
    if (R.isDeleted(I) || isa<CmpInst>(I))
       continue;
    if (auto *LastInsertValue = dyn_cast<InsertValueInst>(I)) {
      OpsChanged |= vectorizeInsertValueInst(LastInsertValue, BB, R);
    } else if (auto *LastInsertElem = dyn_cast<InsertElementInst>(I)) {
      OpsChanged |= vectorizeInsertElementInst(LastInsertElem, BB, R);
     }
  }
   // Now try to vectorize postponed instructions.
   OpsChanged |= tryToVectorize(PostponedInsts, R);

```
Here we don't stick to vector build for reductions and have no interface changes.
What's your opinion in this approach?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132590/new/

https://reviews.llvm.org/D132590