[PATCH] D75069: [LoopVectorizer] Inloop vector reductions

Wed May 20 17:08:01 PDT 2020

dmgreen added inline comments.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:812
+    if (LHS->getOpcode() == Opcode && L->contains(LHS->getParent()) &&
+        LHS->hasOneUse() &&
+        findPathToPhi(LHS, ReductionOperations, Opcode, Phi, L)) {
----------------
fhahn wrote:
> Ayal wrote:
> > Looking for a chain of hasOneUse op's would be easier starting from the Phi and going downwards, until reaching LoopExitInstr?
> > 
> > Note that when extended to handle reductions with conditional bumps, some ops will have more than one use.
> Instead of doing a recursive traversal, would it be simpler to just do the traversal iteratively, at least as long as we are only using at  a single use chain?
Yeah, that direction makes it a lot simpler. Thanks.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3769
     // MinMax reduction have the start value as their identify.
-    if (VF == 1) {
+    if (VF == 1 || UseInloopReductions) {
       VectorStart = Identity = ReductionStartValue;
----------------
Ayal wrote:
> This is dead code if cmp/select chains are not recognized yet, as noted above.
I've added the code to handle minmax too (but not tested it a lot yet. I will try that now).

MVE has instructions for integer min/max reductions, but they can be slow enough to make them not worth using over a normal vmin/vmax. Adds are always not-slower-enough to warrant the inloop reduction (and have other advantages like handling higher type sizes and folding in more instructions.)

My point is that min/max, like some of the other fadd/mul/and/etc might not be used by MVE yet. If you think the code is more hassle than it deserves, then we could take them out for the time being. I'd like to leave them in for consistency though, even if it's not used straight away.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3847
+  if ((VF > 1 && !UseInloopReductions) &&
+      Phi->getType() != RdxDesc.getRecurrenceType()) {
     Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), VF);
----------------
Ayal wrote:
> Checking if !UseInLoopReductions is redundant, as current in loop reductions have phi's of the same type as the recurrence; right?
> May leave it as an assert, until in loop reductions handle truncation/extension.
Right. We look from the phi down (now), so can't go through the 'and' we would need to detect the different bitwidth. That would leave 'and' reduction,  which I think would not got through that codepath to create type promoted phis?

I've put an explicit check in for it to be sure, and added some testing for those cases.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:6537
+    // want to record it as such.
+    if (!ForceInloopReductions)
+      continue;
----------------
Ayal wrote:
> Move this invariant check out as another early-exit?
This does look a little strange here on it's own. The followup patch to add the TTI hook makes it look like:
  if (!PreferInloopReductions &&
        !TTI.preferInloopReduction(Opcode, Phi->getType(),
                                   TargetTransformInfo::ReductionFlags()))
      continue;

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:7286
+    RecurrenceDescriptor &RdxDesc = Reduction.second;
+    if (CM.useInloopReductions(Reduction.first)) {
+      PHINode *Phi = Reduction.first;
----------------
Ayal wrote:
> Iterate over in loop reductions?
Do you mean adding an iterator for iterating over reductions, stepping over the ones not inloop?

It would seem like it's similar to the existing code, but as a new iterator class. My gut says the current code is simpler and clearer what is going on?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:238
   VPTransformState(unsigned VF, unsigned UF, LoopInfo *LI, DominatorTree *DT,
-                   IRBuilder<> &Builder, VectorizerValueMap &ValueMap,
-                   InnerLoopVectorizer *ILV, VPCallback &Callback)
-      : VF(VF), UF(UF), Instance(), LI(LI), DT(DT), Builder(Builder),
+                   const TargetTransformInfo *TTI, IRBuilder<> &Builder,
+                   VectorizerValueMap &ValueMap, InnerLoopVectorizer *ILV,
----------------
Ayal wrote:
> Too bad this requires passing TTI through the State everywhere.
> Perhaps storing TTI in the recipe would be somewhat better.
I've changed it to be stored there. It does mean multiple things are holding TTI. Let me know what you think.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanValue.h:47
   friend class VPSlotTracker;
+  friend class LoopVectorizationPlanner;

----------------
Ayal wrote:
> Why is this friendship needed?
Ah. Old code no longer needed.

================
Comment at: llvm/test/Transforms/LoopVectorize/reduction-inloop.ll:2
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
-; RUN: opt < %s  -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -force-reduction-intrinsics -dce -instcombine -S | FileCheck %s
+; RUN: opt < %s  -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -force-inloop-reductions -force-reduction-intrinsics -dce -instcombine -S | FileCheck %s

----------------
Ayal wrote:
> It is indeed good to examine the diff in this review, but existing non -force-inloop-reductions RUN should be retained; e.g., by cloning the file for a separate -force-inloop-reductions RUN. 
> 
> Would be good to also test with -force-vector-interleave greater than 1.
This testcase is new for this patch (although that wasn't obvious from how I presented it). I was just trying to show the diffs here and the option didn't exist before this patch, hence it wasn't already there.

Are you suggesting we should still have these same test for without -force-inloop-reductions? I think those should already be tested elsewhere, but I can add it if you think it's useful.

================
Comment at: llvm/test/Transforms/LoopVectorize/reduction-inloop.ll:6

 define i32 @reduction_sum(i32 %n, i32* noalias nocapture %A, i32* noalias nocapture %B) nounwind uwtable readonly noinline ssp {
 ; CHECK-LABEL: @reduction_sum(
----------------
fhahn wrote:
> I think it would be also good to add a few additional simpler test cases, e.g. reductions with only 1 or 2 reduction steps in the loop, some with constant scalar operands. And with some of the (unnecessary trounces stripped). Also the test diff can be slightly reduced by having constant trip counts. 
> 
> It's good to have the more complex ones, but it would also be good to start with simpler cases. That way it is probably easier to follow what's going on. And it would reduce the diff quite a bit if the tests for the various different opcodes would be a bit more compact.
(unnecessary trounces stripped)?

The const case is going to look a little odd until there is a instcombine type fold for constant reductions. I'll see about adding that.

Otherwise it is now hopefully a bit leaner. Even if I have added a few extra tests.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75069/new/

https://reviews.llvm.org/D75069