[PATCH] D75069: [LoopVectorizer] Inloop vector reductions

Tue Jul 7 14:08:11 PDT 2020

dmgreen added inline comments.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:812
+    if (LHS->getOpcode() == Opcode && L->contains(LHS->getParent()) &&
+        LHS->hasOneUse() &&
+        findPathToPhi(LHS, ReductionOperations, Opcode, Phi, L)) {
----------------
Ayal wrote:
> dmgreen wrote:
> > Ayal wrote:
> > > dmgreen wrote:
> > > > fhahn wrote:
> > > > > Ayal wrote:
> > > > > > Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?
> > > > > > 
> > > > > > Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.
> > > > > Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at  a single use chain?
> > > > Yeah, that direction makes it a lot simpler. Thanks.
> > > Is treating sub as an add reduction something in-loop reduction could support as a future extension?
> > Hmm. I don't want to say never. A normal inloop reduction looks like:
> >   p = PHI(0, a)
> >   l = VLDR (..)
> >   a = VADDVA(p, l)
> > Where the `VADDV` is an across-vector reductions, and the extra `A` means also add p. Reducing a sub would need to become:
> >   p = PHI(0, a)
> >   l = VLDR (..)
> >   a = VADDV(l)
> >   p = SUB(p, a)
> > With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction.
> > 
> > But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :)
> An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like
> 
> ```
> p = PHI(0, a)
> l = VLDR (..)
> s = VSUBV (zero, l)
> a = VADDVA(p, s)
> ```
> , right?
Yep. We would have the option to trading a scalar instruction for a vector instruction + an extra register (to hold the 0, we only have 8 registers!)

Unfortunately both would be slower than in out-of-loop reduction unless we were vectorizing at a higher factor, though.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3769
     // MinMax reduction have the start value as their identify.
-    if (VF == 1) {
+    if (VF == 1 || UseInloopReductions) {
       VectorStart = Identity = ReductionStartValue;
----------------
Ayal wrote:
> dmgreen wrote:
> > Ayal wrote:
> > > dmgreen wrote:
> > > > Ayal wrote:
> > > > > dmgreen wrote:
> > > > > > Ayal wrote:
> > > > > > > This is dead code if cmp/select chains are not recognized yet, as noted above.
> > > > > > I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).
> > > > > > 
> > > > > > MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)
> > > > > > 
> > > > > > My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.
> > > > > Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize?
> > > > -Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions?
> > > Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link.
> > > 
> > > > And predicated reductions?
> > > These are yet to be handled in-loop, right?
> > >> And predicated reductions?
> > >These are yet to be handled in-loop, right?
> > Yep. It will need a predicated reduction intrinsic. A vecreduce that takes a mask. That will allow us to tail-fold the reductions with trip counts that do not divide the vector factor, which will make them look a lot better under -Os. And nice in general I think once it all starts being tail predicated.
> > 
> > The backend work I was mentioning was that we need to more efficiently transform
> >   x = min(vecreduce.min(z), y)
> > into
> >   x = VMINV(y, z)
> > Where y is (confusingly) accumulated in the case (even though the instruction doesn't have an A suffix). We currently generate
> >   x = min(VMINV(UINT_MAX, z), y)
> > 
> > Once that is sorted out then, yep, using these for Os sounds like a good plan.
> Re: predicated reductions - could they be handled by replacing masked-off elements with `Identity` using a select prior to reduction? To be potentially folded later by suitable targets into a predicated reduction operation which they may support.
> Somewhat akin to "passthru" values of masked loads.
Oh. So select
  s = select m, a, 0
  v = vecreduce.add s
to a predicated vaddv?

Yeah, sounds interesting. I'll look into that as an alternative to predicated intrinsics. Nice suggestion.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75069/new/

https://reviews.llvm.org/D75069