[llvm] r236031 - transform fadd chains to increase parallelism

Thu Apr 30 08:19:47 PDT 2015

> Also, isn’t there the exact same register pressure concern in the DAG as
well? Especially considering the fact that this is target-independent?

Now that the patch is reverted due to a spilling catastrophe, this seems
obvious. :)

I also seem to have completely misjudged the potential compile-time hit. I
did look at that using test-suite on x86, so I'd love to know if D9363 made
any difference for your out-of-tree target and/or tests.

> By the way the name of the function is very generic (ReassociateBinops),
do you plan to extend it to do perform transformations?
> Right now it does not carry what it does.

I had a vague feeling that someone, someday would extend it in some
interesting way, but the truth is really just that I don't know what to
name this:

//  (op N0: (op N00: (op z, w), N01: y), N1: x) ->
//  (op N00: (op z, w), (op N1: x, N01: y))

On Wed, Apr 29, 2015 at 10:03 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

> Hi Sanjay,
>
> Thanks for your detailed answer.
>
>
> On Apr 29, 2015, at 8:03 PM, Sanjay Patel <spatel at rotateright.com> wrote:
>
> Hi Mehdi,
>
> Let me answer the last question first: why not instcombine?
> Because I noticed this pattern was created by another DAGCombine in:
> http://reviews.llvm.org/D8941
>
> So although doing the fold here does not preclude an instcombine, doing an
> instcombine alone would not be sufficient.
>
>
> Good point!
>
>
> I'm also not sure that the transform should be done that early because
> it's not a completely free transform; I think there is a potential register
> pressure concern.
>
>
>
> How does the fact that we do it early would increase the register
> pressure?
> Also, isn’t there the exact same register pressure concern in the DAG as
> well? Especially considering the fact that this is target-independent?
>
>
> I agree that the fold should be applied more generally. Please see the
> follow up patch and post commit thread for r236171. I welcome any advice on
> how to do this better.
>
>
> I gave a try in http://reviews.llvm.org/D9363 ; let me know what you
> think.
> By the way the name of the function is very generic (ReassociateBinops),
> do you plan to extend it to do perform transformations? Right now it does
> not carry what it does.
>
>
>
> Regarding the canonicalization, I wasn't aware of any downside to
> commuting the operands, so yes, I added it to not have to duplicate the
> fold logic for all permutations of the operand order. I was also
> inf-looping in an early draft of the patch, and canonicalization was my
> solution to avoid that without multiplying the if checks. If these are
> not valid reasons for the canonicalization, I can fix it.
>
>
> This is introducing a strong canonicalization which is not very
> constrained. No target would be able to canonicalize on more specific
> conditions.
> I don’t know people use to judge if a canonicalization is justified or
> not, I tend to at least look at a “cost/benefit” analysis:
> - how does this enable/simplify other transformations?
> - how does this constrains other transformations (infinite loop problem…)
>
> The benefits are unclear to me right now, but that does not mean they
> don’t exist.
>
>
> —
> Mehdi
>
>
>
>
>
> On Wednesday, April 29, 2015, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
>> Hello Sanjay,
>>
>> In this commit you introduce a new canonicalization in the DAG:
>>
>>  (fadd (otherop, fadd)) -> (fadd (fadd, otherop))
>>
>> It is not clear to me why this canonicalization is desirable in general
>> (other than to ease the implementation of your transformation, which does
>> not seem to justify it alone IMO).
>>
>> Also, if such a canonicalization is needed, my feeling is that it
>> shouldn’t be hidden deeply here, and should probably apply to a bunch of
>> other binop.
>>
>> As a side question: I’m curious why isn’t this been performed in
>> instcombine?
>>
>> Thanks,
>>
>> Mehdi
>>
>>
>>
>> > On Apr 28, 2015, at 2:03 PM, Sanjay Patel <spatel at rotateright.com>
>> wrote:
>> >
>> > Author: spatel
>> > Date: Tue Apr 28 16:03:22 2015
>> > New Revision: 236031
>> >
>> > URL: http://llvm.org/viewvc/llvm-project?rev=236031&view=rev
>> > Log:
>> > transform fadd chains to increase parallelism
>> >
>> > This is a compromise: with this simple patch, we should always handle a
>> chain of exactly 3
>> > operations optimally, but we're not generating the optimal balanced
>> binary tree for a longer
>> > sequence.
>> >
>> > In general, this transform will reduce the dependency chain for a
>> sequence of instructions
>> > using N operands from a worst case N-1 dependent operations to N/2
>> dependent operations.
>> > The optimal balanced binary tree would reduce the chain to log2(N).
>> >
>> > The trade-off for not dealing with longer sequences is: (1) we have
>> less complexity in the
>> > compiler, (2) we avoid unknown compile-time blowup calculating a
>> balanced tree, and (3) we
>> > don't need to worry about the increased register pressure required to
>> parallelize longer
>> > sequences. It also seems unlikely that we would ever encounter really
>> long strings of
>> > dependent ops like that in the wild, but I'm not sure how to verify
>> that speculation.
>> > FWIW, I see no perf difference for test-suite running on btver2
>> (x86-64) with -ffast-math
>> > and this patch.
>> >
>> > We can extend this patch to cover other associative operations such as
>> fmul, fmax, fmin,
>> > integer add, integer mul.
>> >
>> > This is a partial fix for:
>> > https://llvm.org/bugs/show_bug.cgi?id=17305
>> >
>> > and if extended:
>> > https://llvm.org/bugs/show_bug.cgi?id=21768
>> > https://llvm.org/bugs/show_bug.cgi?id=23116
>> >
>> > The issue also came up in:
>> > http://reviews.llvm.org/D8941
>> >
>> > Differential Revision: http://reviews.llvm.org/D9232
>> >
>> >
>> > Modified:
>> >    llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>> >    llvm/trunk/test/CodeGen/X86/fp-fast.ll
>> >
>> > Modified: llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>> > URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff
>> >
>> ==============================================================================
>> > --- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)
>> > +++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Apr 28
>> 16:03:22 2015
>> > @@ -7801,6 +7801,24 @@ SDValue DAGCombiner::visitFADD(SDNode *N
>> >                            N0.getOperand(0), DAG.getConstantFP(4.0, DL,
>> VT));
>> >       }
>> >     }
>> > +
>> > +    // Canonicalize chains of adds to LHS to simplify the following
>> transform.
>> > +    if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
>> > +      return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
>> > +
>> > +    // Convert a chain of 3 dependent operations into 2 independent
>> operations
>> > +    // and 1 dependent operation:
>> > +    //  (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
>> > +    //  (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
>> > +    if (N0.getOpcode() == ISD::FADD &&  N0.hasOneUse() &&
>> > +        N1.getOpcode() != ISD::FADD) {
>> > +      SDValue N00 = N0.getOperand(0);
>> > +      if (N00.getOpcode() == ISD::FADD) {
>> > +        SDValue N01 = N0.getOperand(1);
>> > +        SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
>> > +        return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
>> > +      }
>> > +    }
>> >   } // enable-unsafe-fp-math
>> >
>> >   // FADD -> FMA combines:
>> >
>> > Modified: llvm/trunk/test/CodeGen/X86/fp-fast.ll
>> > URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff
>> >
>> ==============================================================================
>> > --- llvm/trunk/test/CodeGen/X86/fp-fast.ll (original)
>> > +++ llvm/trunk/test/CodeGen/X86/fp-fast.ll Tue Apr 28 16:03:22 2015
>> > @@ -113,3 +113,46 @@ define float @test11(float %a) {
>> >   %t2 = fadd float %a, %t1
>> >   ret float %t2
>> > }
>> > +
>> > +; Verify that the first two adds are independent; the destination
>> registers
>> > +; are used as source registers for the third add.
>> > +
>> > +define float @reassociate_adds1(float %a, float %b, float %c, float
>> %d) {
>> > +; CHECK-LABEL: reassociate_adds1:
>> > +; CHECK:       # BB#0:
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    retq
>> > +  %add0 = fadd float %a, %b
>> > +  %add1 = fadd float %add0, %c
>> > +  %add2 = fadd float %add1, %d
>> > +  ret float %add2
>> > +}
>> > +
>> > +define float @reassociate_adds2(float %a, float %b, float %c, float
>> %d) {
>> > +; CHECK-LABEL: reassociate_adds2:
>> > +; CHECK:       # BB#0:
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    retq
>> > +  %add0 = fadd float %a, %b
>> > +  %add1 = fadd float %c, %add0
>> > +  %add2 = fadd float %add1, %d
>> > +  ret float %add2
>> > +}
>> > +
>> > +define float @reassociate_adds3(float %a, float %b, float %c, float
>> %d) {
>> > +; CHECK-LABEL: reassociate_adds3:
>> > +; CHECK:       # BB#0:
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    retq
>> > +  %add0 = fadd float %a, %b
>> > +  %add1 = fadd float %add0, %c
>> > +  %add2 = fadd float %d, %add1
>> > +  ret float %add2
>> > +}
>> > +
>> >
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/1b4a2a76/attachment.html>