[llvm] r236031 - transform fadd chains to increase parallelism

Wed Apr 29 21:12:31 PDT 2015

I’ve reverted this in r236199.  We’ll continue to work on finding a shareable testcase that illustrates the impacts we’re seeing.

—Owen

> On Apr 29, 2015, at 8:32 PM, Sanjay Patel <spatel at rotateright.com> wrote:
> 
> Oh wow, sorry. I knew there was a register pressure risk, but I guessed wrongly that it would be contained by the limited length of the transform. 
> 
> If it's easier, please feel free to revert. I won't be at my dev machine until tomorrow AM.
> 
> I'm not seeing how in-order vs. OOO is a factor?
> 
> On Wednesday, April 29, 2015, Owen Anderson <resistor at mac.com> wrote:
> Woah!  This change is very not OK for in-order targets!  Additionally, we’re observing massive regressions, including catastrophic spilling and catastrophic compile time, resulting from it.  We’re still reducing/sanitizing a testcase we can share, but *at least* this needs to move under some kind of target hook so that in-order targets can opt-out.
> 
> —Owen
> 
> > On Apr 28, 2015, at 2:03 PM, Sanjay Patel <spatel at rotateright.com> wrote:
> >
> > Author: spatel
> > Date: Tue Apr 28 16:03:22 2015
> > New Revision: 236031
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=236031&view=rev
> > Log:
> > transform fadd chains to increase parallelism
> >
> > This is a compromise: with this simple patch, we should always handle a chain of exactly 3
> > operations optimally, but we're not generating the optimal balanced binary tree for a longer
> > sequence.
> >
> > In general, this transform will reduce the dependency chain for a sequence of instructions
> > using N operands from a worst case N-1 dependent operations to N/2 dependent operations.
> > The optimal balanced binary tree would reduce the chain to log2(N).
> >
> > The trade-off for not dealing with longer sequences is: (1) we have less complexity in the
> > compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we
> > don't need to worry about the increased register pressure required to parallelize longer
> > sequences. It also seems unlikely that we would ever encounter really long strings of
> > dependent ops like that in the wild, but I'm not sure how to verify that speculation.
> > FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math
> > and this patch.
> >
> > We can extend this patch to cover other associative operations such as fmul, fmax, fmin,
> > integer add, integer mul.
> >
> > This is a partial fix for:
> > https://llvm.org/bugs/show_bug.cgi?id=17305
> >
> > and if extended:
> > https://llvm.org/bugs/show_bug.cgi?id=21768
> > https://llvm.org/bugs/show_bug.cgi?id=23116
> >
> > The issue also came up in:
> > http://reviews.llvm.org/D8941
> >
> > Differential Revision: http://reviews.llvm.org/D9232
> >
> >
> > Modified:
> >    llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
> >    llvm/trunk/test/CodeGen/X86/fp-fast.ll
> >
> > Modified: llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff
> > ==============================================================================
> > --- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)
> > +++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Apr 28 16:03:22 2015
> > @@ -7801,6 +7801,24 @@ SDValue DAGCombiner::visitFADD(SDNode *N
> >                            N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));
> >       }
> >     }
> > +
> > +    // Canonicalize chains of adds to LHS to simplify the following transform.
> > +    if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
> > +      return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
> > +
> > +    // Convert a chain of 3 dependent operations into 2 independent operations
> > +    // and 1 dependent operation:
> > +    //  (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
> > +    //  (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
> > +    if (N0.getOpcode() == ISD::FADD &&  N0.hasOneUse() &&
> > +        N1.getOpcode() != ISD::FADD) {
> > +      SDValue N00 = N0.getOperand(0);
> > +      if (N00.getOpcode() == ISD::FADD) {
> > +        SDValue N01 = N0.getOperand(1);
> > +        SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
> > +        return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
> > +      }
> > +    }
> >   } // enable-unsafe-fp-math
> >
> >   // FADD -> FMA combines:
> >
> > Modified: llvm/trunk/test/CodeGen/X86/fp-fast.ll
> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff
> > ==============================================================================
> > --- llvm/trunk/test/CodeGen/X86/fp-fast.ll (original)
> > +++ llvm/trunk/test/CodeGen/X86/fp-fast.ll Tue Apr 28 16:03:22 2015
> > @@ -113,3 +113,46 @@ define float @test11(float %a) {
> >   %t2 = fadd float %a, %t1
> >   ret float %t2
> > }
> > +
> > +; Verify that the first two adds are independent; the destination registers
> > +; are used as source registers for the third add.
> > +
> > +define float @reassociate_adds1(float %a, float %b, float %c, float %d) {
> > +; CHECK-LABEL: reassociate_adds1:
> > +; CHECK:       # BB#0:
> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    retq
> > +  %add0 = fadd float %a, %b
> > +  %add1 = fadd float %add0, %c
> > +  %add2 = fadd float %add1, %d
> > +  ret float %add2
> > +}
> > +
> > +define float @reassociate_adds2(float %a, float %b, float %c, float %d) {
> > +; CHECK-LABEL: reassociate_adds2:
> > +; CHECK:       # BB#0:
> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    retq
> > +  %add0 = fadd float %a, %b
> > +  %add1 = fadd float %c, %add0
> > +  %add2 = fadd float %add1, %d
> > +  ret float %add2
> > +}
> > +
> > +define float @reassociate_adds3(float %a, float %b, float %c, float %d) {
> > +; CHECK-LABEL: reassociate_adds3:
> > +; CHECK:       # BB#0:
> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
> > +; CHECK-NEXT:    retq
> > +  %add0 = fadd float %a, %b
> > +  %add1 = fadd float %add0, %c
> > +  %add2 = fadd float %d, %add1
> > +  ret float %add2
> > +}
> > +
> >
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>