[PATCH] D98435: [LoopVectorize] Add strict in-order reduction support for fixed-width vectorization

Fri Mar 19 10:30:02 PDT 2021

kmclaughlin added a comment.

Thanks all for reviewing this!

================
Comment at: llvm/include/llvm/Analysis/IVDescriptors.h:257
   bool IsSigned = false;
+  // If this is an ordered reduction
+  bool IsOrdered = false;
----------------
dmgreen wrote:
> david-arm wrote:
> > david-arm wrote:
> > > spatel wrote:
> > > > dmgreen wrote:
> > > > > Am I correct in saying that Ordered reductions are a subset of floating point reductions that
> > > > > - don't have AllowReassoc on the fadd
> > > > > - have only a single fadd added in each loop iteration (but possibly predicated to be one of several additions). So either fadd(redphi(..)) or phi(fadd(redphi), redphi) or phi(fadd(redphi), fadd(redphi)) ?
> > > > > 
> > > > > And that we can't easily get this info from just the FMF or the ExactFPMathInst?
> > > > I have the same question. I was looking at this code recently ( 36a489d19475 , 1bee549737ac ) -- but I'm still not sure if we are behaving correctly or optimally. 
> > > > 
> > > > The "ExactFPMathInst" seems to provide the same information. I experimented with changing that to a bool. The only reason we appear to save an instruction pointer rather than a bool is so we can provide a prettier optimization remark in the case a loop is not vectorized. Ie, we say something like: "The FP math at line 42 is not associative" rather than "The loop starting at line 39 requires exact FP math".
> > > I think the IsOrdered flag here is more of a convenience so that we can avoid calling the more expensive `checkOrderedReduction` function every time we may want to use strict, in-order reduction intrinsics. The `checkOrderedReduction` does cast the instruction to a `FPMathOperator` and look for the allows-reassoc flag.
> > Hi @spatel, oh I see now what you mean. I didn't realise we now had a ExactFPMathInst in RecurrenceDescriptor. It looks like you're saying instead of setting the IsOrdered flag we can just set ExactFPMathInst to the instruction in ReduxDesc, which then gets passed on to the RecurrenceDescriptor.
> As far as I understand, we still can not vectorize in-order reductions with multiple adds in the loop. Something like https://godbolt.org/z/c9qd1v. The ordering would change if we tried.
> 
> So the flag may still be needed for the second bullet point above, which is a large part of the checkOrderedReduction. Either that or a way to compute that there is only a single fadd (possibly through a select/if block phi).
Hi @dmgreen & @spatel, I also didn't realise that ExactFPMathInst is very similar and already checks the hasAllowReassoc flag. I've left the IsOrdered flag in this patch as I think there is still a need for it in the case mentioned above where there is a chain of fadds (I've added a test for this to `strict-fadd.ll`). I also changed the conditions in checkOrderedReduction slightly, to check if `Exit == ExactFPMathInst`.

================
Comment at: llvm/lib/Analysis/IVDescriptors.cpp:204
+  // If this comes from a Phi node, look through it
+  if (auto *EIP = dyn_cast<PHINode>(Exit)) {
+    if (EIP->getNumIncomingValues() != 2)
----------------
dmgreen wrote:
> I think this may need to check all branches of the phi, if I'm understanding what this is doing exactly. They could have different instructions down each path with different flags (although that would be quite rare I suspect).
> 
> There might also be selects, if this is matching on if-block phis. I'm not sure if that is handled in AddReductionVar though.
I realised whilst trying to add a better test than `@fadd_conditional_rdx` for conditional reductions that with the tests I have written this part of the function is redundant.
I've removed it from this patch for now since I don't think I can write any tests for it yet - I believe in addition to looking through Phi nodes here and checking all branches, I would also need to change `getReductionOpChain` to look through Phis and Selects in order to vectorize with inloop reductions, is this correct?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4361
   {
-    // Floating-point operations should have some FMF to enable the reduction.
-    IRBuilderBase::FastMathFlagGuard FMFG(Builder);
-    Builder.setFastMathFlags(RdxDesc.getFastMathFlags());
-    for (unsigned Part = 1; Part < UF; ++Part) {
-      Value *RdxPart = State.get(LoopExitInstDef, Part);
-      if (Op != Instruction::ICmp && Op != Instruction::FCmp) {
-        ReducedPartRdx = Builder.CreateBinOp(
-            (Instruction::BinaryOps)Op, RdxPart, ReducedPartRdx, "bin.rdx");
-      } else {
-        ReducedPartRdx = createMinMaxOp(Builder, RK, ReducedPartRdx, RdxPart);
+    if (Cost->isInLoopReduction(Phi) && useOrderedReductions(RdxDesc))
+      ReducedPartRdx = State.get(LoopExitInstDef, UF - 1);
----------------
dmgreen wrote:
> Cost->isInLoopReduction(Phi) -> IsInLoopReductionPhi
> 
> Does this work in-order for UF > 1? I feel like the order of the adds will be changed, so that it's no-longer in-order.
> 
> If not, is this code needed, if it can only be using UF == 1 and ReducedPartRdx is set to State.get(LoopExitInstDef, 0) above.
The changes I made to `VPReductionRecipe::execute` should ensure that ordering is enforced when the UF > 1, through the way the chain is constructed for each unrolled part. I was missing a change to fixReduction to correctly set the incoming values of Phi for each unrolled part in my first patch, however I believe this is fixed now and the test for unrolling (`@fadd_strict_unroll`) has been updated accordingly.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll:1
+; RUN: opt < %s -loop-vectorize -mtriple aarch64-unknown-linux-gnu -mattr=+sve -enable-strict-reducs -S | FileCheck %s -check-prefix=CHECK
+
----------------
fhahn wrote:
> `+sve` is not needed for the test?
> 
> Can you also add a negative test?
Hi @fhahn, I've added a negative test where we have multiple fadds in the loop (@fadd_multiple) & removed `+sve`

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll:8
+; CHECK: %[[LOAD:.*]] = load <8 x float>, <8 x float>*
+; CHECK: %[[MASKED_VEC:.*]] = select <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x float> %[[LOAD]], <8 x float> zeroinitializer
+; CHECK: %[[RDX]] = call float @llvm.vector.reduce.fadd.v8f32(float %[[VEC_PHI]], <8 x float> %[[MASKED_VEC]])
----------------
dmgreen wrote:
> Note that the identity value for a fadd (without nsz) should be -0.0, not 0.0. Otherwise 0.0 + n doesn't always equal n. https://alive2.llvm.org/ce/z/LS2RK3
> 
> I believe the select would only be needed if the reduction is predicated though, as mentioned above.
I've created a separate patch which changes the identity value to -0.0 (D98963)

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd.ll:118
+
+define float @fadd_conditional_rdx(float* noalias nocapture readonly %a, float* noalias nocapture readonly %b, i64 %n) {
+; CHECK-LABEL: @fadd_conditional_rdx
----------------
david-arm wrote:
> fhahn wrote:
> > I don't think this test is testing what you intended it to. The condition `> 0.5f` is in the loop pre-header. I think it should be in the loop body, so it tests generating the correct mask? (The C source here is a bit misleading IMO)
> Hi @fhahn, I suspect it's because when compiled with clang the condition is hoisted out before we reach the vectoriser, but you're right I think that the condition should be in the loop. At the moment this test is really the same as `@fadd_strict`.
I tried to write a test where the condition is in the loop, but I don't think this is possible yet with just the changes in this patch (I believe there would also be some changes needed to getReductionOpChain to try and look through Phis before I could try this).
Since this isn't testing anything that `@fadd_strict` doesn't already, I've removed it and instead added `@fadd_predicated` (which uses loop.vectorize.predicate.enable) and `@fadd_conditional` to test the masking in `VPReductionRecipe::execute` with in-order reductions.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D98435/new/

https://reviews.llvm.org/D98435