[PATCH] D75069: [LoopVectorizer] Inloop vector reductions

Tue Jul 7 12:33:26 PDT 2020

Ayal added inline comments.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:844
+  // check the opcode is correct (and dont allow them to be Subs) and that they
+  // have expected to have an extra user from the LCSSA phi, but select do not
+  // have a single use from the phi, meaning we should always expect 2 uses.
----------------
rephrase sentence so it parses

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:812
+    if (LHS->getOpcode() == Opcode && L->contains(LHS->getParent()) &&
+        LHS->hasOneUse() &&
+        findPathToPhi(LHS, ReductionOperations, Opcode, Phi, L)) {
----------------
dmgreen wrote:
> Ayal wrote:
> > dmgreen wrote:
> > > fhahn wrote:
> > > > Ayal wrote:
> > > > > Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?
> > > > > 
> > > > > Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.
> > > > Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at  a single use chain?
> > > Yeah, that direction makes it a lot simpler. Thanks.
> > Is treating sub as an add reduction something in-loop reduction could support as a future extension?
> Hmm. I don't want to say never. A normal inloop reduction looks like:
>   p = PHI(0, a)
>   l = VLDR (..)
>   a = VADDVA(p, l)
> Where the `VADDV` is an across-vector reductions, and the extra `A` means also add p. Reducing a sub would need to become:
>   p = PHI(0, a)
>   l = VLDR (..)
>   a = VADDV(l)
>   p = SUB(p, a)
> With the SUB as a separate scalar instruction, which would be quite slow on some hardware (getting a value over from the VADDV to the SUB). So this would almost certainly be slower than a out-of-loop reduction.
> 
> But if we could end up using a higher vector factor for the reduction, or end up vectorizing loops that would previously not be vectorized.. that may lead to a gain overall to overcome the extra cost of adding the sub to the loop. It will require some very careful costing I think. And maybe the ability to create multiple vplans and cost them against one another :)
An original sub code, say, acc -= a[i], can be treated as acc += (-a[i]). This could be in-loop reduced by first negating a[i]'s, at LV's LLVM-IR level, presumably lowered later to something like

```
p = PHI(0, a)
l = VLDR (..)
s = VSUBV (zero, l)
a = VADDVA(p, s)
```
, right?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7348
+      // For min/max reducitons, where we have a pair of icmp/select, we also
+      // need to recode the ICmp recipe, so it can be removed later.
+      if (Kind == RecurrenceDescriptor::RK_IntegerMinMax ||
----------------
recode >> record

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3751
   setDebugLocFromInst(Builder, ReductionStartValue);
+  bool UseInloopReductions = Cost->useInloopReductions(Phi);

----------------
Ayal wrote:
> isInLoopReductionPhi
Ahh, this should actually be capitalized `IsInLoopReductionPhi`

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3769
     // MinMax reduction have the start value as their identify.
-    if (VF == 1) {
+    if (VF == 1 || UseInloopReductions) {
       VectorStart = Identity = ReductionStartValue;
----------------
dmgreen wrote:
> Ayal wrote:
> > dmgreen wrote:
> > > Ayal wrote:
> > > > dmgreen wrote:
> > > > > Ayal wrote:
> > > > > > This is dead code if cmp/select chains are not recognized yet, as noted above.
> > > > > I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).
> > > > > 
> > > > > MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)
> > > > > 
> > > > > My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.
> > > > Would be good to make sure code is being exercised and tested. Could inloop min/max (and/or other reductions) help reduce code size, and be applied when vectorizing under optsize?
> > > -Os sounds like a good plan. It will take some backend work to make it efficient enough first though. And predicated reductions?
> > Hoisting the horizontal reduction from the middle block into the loop could potentially eliminate the middle block (as in tests below), so could presumably lead to code of smaller size? At-least for in-loop chains of a single link.
> > 
> > > And predicated reductions?
> > These are yet to be handled in-loop, right?
> >> And predicated reductions?
> >These are yet to be handled in-loop, right?
> Yep. It will need a predicated reduction intrinsic. A vecreduce that takes a mask. That will allow us to tail-fold the reductions with trip counts that do not divide the vector factor, which will make them look a lot better under -Os. And nice in general I think once it all starts being tail predicated.
> 
> The backend work I was mentioning was that we need to more efficiently transform
>   x = min(vecreduce.min(z), y)
> into
>   x = VMINV(y, z)
> Where y is (confusingly) accumulated in the case (even though the instruction doesn't have an A suffix). We currently generate
>   x = min(VMINV(UINT_MAX, z), y)
> 
> Once that is sorted out then, yep, using these for Os sounds like a good plan.
Re: predicated reductions - could they be handled by replacing masked-off elements with `Identity` using a select prior to reduction? To be potentially folded later by suitable targets into a predicated reduction operation which they may support.
Somewhat akin to "passthru" values of masked loads.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7286
+    RecurrenceDescriptor &RdxDesc = Reduction.second;
+    if (CM.useInloopReductions(Reduction.first)) {
+      PHINode *Phi = Reduction.first;
----------------
dmgreen wrote:
> Ayal wrote:
> > dmgreen wrote:
> > > Ayal wrote:
> > > > dmgreen wrote:
> > > > > Ayal wrote:
> > > > > > Iterate over in loop reductions?
> > > > > Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?
> > > > > 
> > > > > It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?
> > > > Suggestion was to iterate over the PHIs/elements of InloopReductionChains, rather than over all reduction PHIs of Legal->getReductionVars().
> > > > 
> > > > (Better early-exit via "if (!CM.isInLoopReduction(Reduction.first)) continue;")
> > > I believe that InloopReductionChains would not iterate in a deterministic order, which is why I avoided it.
> > > 
> > > Perhaps that would not matter here? The reductions should be independent anyway. Seems safer to try and use deterministic ordering anyway if we can.
> > Agreed it would be better to use deterministic ordering. How about letting InloopReductionChains be a MapVector and iterate over
> > for (auto &Reduction : CM.getInloopReductions())?
> > The number of reductions is expected to be small, w/o removals.
> MapVector sounds good. I've changed it to use that and tried to use that in a few more places. Let me know what you think.
Uses as MapVector look good to me, thanks. Can also retain isInLoopReduction(PHI).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75069/new/

https://reviews.llvm.org/D75069