[PATCH] D75069: [LoopVectorizer] Inloop vector reductions

Tue Jul 7 15:05:22 PDT 2020

Ayal added inline comments.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:812
+    if (LHS->getOpcode() == Opcode && L->contains(LHS->getParent()) &&
+        LHS->hasOneUse() &&
+        findPathToPhi(LHS, ReductionOperations, Opcode, Phi, L)) {
----------------
dmgreen wrote:
> Ayal wrote:
> > dmgreen wrote:
> > > Ayal wrote:
> > > > dmgreen wrote:
> > > > > fhahn wrote:
> > > > > > Ayal wrote:
> > > > > > > Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?
> > > > > > > 
> > > > > > > Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.
> > > > > > Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at  a single use chain?
> > > > > Yeah, that direction makes it a lot simpler. Thanks.
> > > > Is treating sub as an add reduction something in-loop reduction could support as a future extension?
> > > Hmm. I don't want to say never. A normal inloop reduction looks like:
> > >   p = PHI(0, a)
> > >   l = VLDR (..)
> > >   a = VADDVA(p, l)
> > > Where the `VADDV` is an across-vector reductions, and the extra `A` means also add p. Reducing a sub would need to become:
> > >   p = PHI(0, a)
> > >   l = VLDR (..)
> > >   a = VADDV(l)
> > >   p = SUB(p, a)
> > > With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction.
> > > 
> > > But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :)
> > An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like
> > 
> > ```
> > p = PHI(0, a)
> > l = VLDR (..)
> > s = VSUBV (zero, l)
> > a = VADDVA(p, s)
> > ```
> > , right?
> Yep. We would have the option to trading a scalar instruction for a vector instruction + an extra register (to hold the 0, we only have 8 registers!)
> 
> Unfortunately both would be slower than in out-of-loop reduction unless we were vectorizing at a higher factor, though.
ok, so sub's can be handled in-loop, but doing so is expected to be more costly than out-of-loop, at-least if a horizontal add operation is to be used rather than a horizontal subtract; probably worth a comment.
If a reduction chain has only sub's, they could all sink - negating the sum once after the loop, using VADDVA inside. Doing so however will retain the middle block, i.e., w/o decreasing code size.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75069/new/

https://reviews.llvm.org/D75069