[llvm] r236031 - transform fadd chains to increase parallelism

Thu Apr 30 08:29:43 PDT 2015

> On Apr 30, 2015, at 8:19 AM, Sanjay Patel <spatel at rotateright.com> wrote:
> 
> > Also, isn’t there the exact same register pressure concern in the DAG as well? Especially considering the fact that this is target-independent?
> 
> Now that the patch is reverted due to a spilling catastrophe, this seems obvious. :)

I don’t know, you may see other reasons why it can even be worse in InstCombine :)

> 
> I also seem to have completely misjudged the potential compile-time hit. I did look at that using test-suite on x86, so I'd love to know if D9363 made any difference for your out-of-tree target and/or tests.

Well it solves the infinite loop we had locally at least :)

I’m still using bugpoint on the test that goes from 6338 to 18559 instructions locally when applying your change (with D9363).

> 
> 
> > By the way the name of the function is very generic (ReassociateBinops), do you plan to extend it to do perform transformations?
> > Right now it does not carry what it does.
> 
> I had a vague feeling that someone, someday would extend it in some interesting way, but the truth is really just that I don't know what to name this:
> 
> //  (op N0: (op N00: (op z, w), N01: y), N1: x) ->
> //  (op N00: (op z, w), (op N1: x, N01: y))

I don’t have a great suggestion.

— 
Mehdi

> 
> 
> 
> 
> 
> On Wed, Apr 29, 2015 at 10:03 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote:
> Hi Sanjay, 
> 
> Thanks for your detailed answer.
> 
> 
>> On Apr 29, 2015, at 8:03 PM, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>> 
>> Hi Mehdi,
>> 
>> Let me answer the last question first: why not instcombine?
>> Because I noticed this pattern was created by another DAGCombine in:
>> http://reviews.llvm.org/D8941 <http://reviews.llvm.org/D8941>
>> 
>> So although doing the fold here does not preclude an instcombine, doing an instcombine alone would not be sufficient.
> 
> Good point!
> 
> 
>> I'm also not sure that the transform should be done that early because it's not a completely free transform; I think there is a potential register pressure concern.
> 
> 
> How does the fact that we do it early would increase the register pressure? 
> Also, isn’t there the exact same register pressure concern in the DAG as well? Especially considering the fact that this is target-independent?
> 
>> 
>> I agree that the fold should be applied more generally. Please see the follow up patch and post commit thread for r236171. I welcome any advice on how to do this better.
> 
> I gave a try in http://reviews.llvm.org/D9363 <http://reviews.llvm.org/D9363> ; let me know what you think.
> By the way the name of the function is very generic (ReassociateBinops), do you plan to extend it to do perform transformations? Right now it does not carry what it does.
> 
> 
>> 
>> Regarding the canonicalization, I wasn't aware of any downside to commuting the operands, so yes, I added it to not have to duplicate the fold logic for all permutations of the operand order. I was also inf-looping in an early draft of the patch, and canonicalization was my solution to avoid that without multiplying the if checks. If these are not valid reasons for the canonicalization, I can fix it.
> 
> This is introducing a strong canonicalization which is not very constrained. No target would be able to canonicalize on more specific conditions. 
> I don’t know people use to judge if a canonicalization is justified or not, I tend to at least look at a “cost/benefit” analysis: 
> - how does this enable/simplify other transformations?
> - how does this constrains other transformations (infinite loop problem…)
> 
> The benefits are unclear to me right now, but that does not mean they don’t exist.
> 
> 
> — 
> Mehdi
> 
> 
> 
>> 
>> 
>> On Wednesday, April 29, 2015, Mehdi Amini <mehdi.amini at apple.com <>> wrote:
>> Hello Sanjay,
>> 
>> In this commit you introduce a new canonicalization in the DAG:
>> 
>>  (fadd (otherop, fadd)) -> (fadd (fadd, otherop))
>> 
>> It is not clear to me why this canonicalization is desirable in general (other than to ease the implementation of your transformation, which does not seem to justify it alone IMO).
>> 
>> Also, if such a canonicalization is needed, my feeling is that it shouldn’t be hidden deeply here, and should probably apply to a bunch of other binop.
>> 
>> As a side question: I’m curious why isn’t this been performed in instcombine?
>> 
>> Thanks,
>> 
>> Mehdi
>> 
>> 
>> 
>> > On Apr 28, 2015, at 2:03 PM, Sanjay Patel <spatel at rotateright.com <>> wrote:
>> >
>> > Author: spatel
>> > Date: Tue Apr 28 16:03:22 2015
>> > New Revision: 236031
>> >
>> > URL: http://llvm.org/viewvc/llvm-project?rev=236031&view=rev <http://llvm.org/viewvc/llvm-project?rev=236031&view=rev>
>> > Log:
>> > transform fadd chains to increase parallelism
>> >
>> > This is a compromise: with this simple patch, we should always handle a chain of exactly 3
>> > operations optimally, but we're not generating the optimal balanced binary tree for a longer
>> > sequence.
>> >
>> > In general, this transform will reduce the dependency chain for a sequence of instructions
>> > using N operands from a worst case N-1 dependent operations to N/2 dependent operations.
>> > The optimal balanced binary tree would reduce the chain to log2(N).
>> >
>> > The trade-off for not dealing with longer sequences is: (1) we have less complexity in the
>> > compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we
>> > don't need to worry about the increased register pressure required to parallelize longer
>> > sequences. It also seems unlikely that we would ever encounter really long strings of
>> > dependent ops like that in the wild, but I'm not sure how to verify that speculation.
>> > FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math
>> > and this patch.
>> >
>> > We can extend this patch to cover other associative operations such as fmul, fmax, fmin,
>> > integer add, integer mul.
>> >
>> > This is a partial fix for:
>> > https://llvm.org/bugs/show_bug.cgi?id=17305 <https://llvm.org/bugs/show_bug.cgi?id=17305>
>> >
>> > and if extended:
>> > https://llvm.org/bugs/show_bug.cgi?id=21768 <https://llvm.org/bugs/show_bug.cgi?id=21768>
>> > https://llvm.org/bugs/show_bug.cgi?id=23116 <https://llvm.org/bugs/show_bug.cgi?id=23116>
>> >
>> > The issue also came up in:
>> > http://reviews.llvm.org/D8941 <http://reviews.llvm.org/D8941>
>> >
>> > Differential Revision: http://reviews.llvm.org/D9232 <http://reviews.llvm.org/D9232>
>> >
>> >
>> > Modified:
>> >    llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>> >    llvm/trunk/test/CodeGen/X86/fp-fast.ll
>> >
>> > Modified: llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff <http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff>
>> > ==============================================================================
>> > --- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)
>> > +++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Apr 28 16:03:22 2015
>> > @@ -7801,6 +7801,24 @@ SDValue DAGCombiner::visitFADD(SDNode *N
>> >                            N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));
>> >       }
>> >     }
>> > +
>> > +    // Canonicalize chains of adds to LHS to simplify the following transform.
>> > +    if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
>> > +      return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
>> > +
>> > +    // Convert a chain of 3 dependent operations into 2 independent operations
>> > +    // and 1 dependent operation:
>> > +    //  (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
>> > +    //  (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
>> > +    if (N0.getOpcode() == ISD::FADD &&  N0.hasOneUse() &&
>> > +        N1.getOpcode() != ISD::FADD) {
>> > +      SDValue N00 = N0.getOperand(0);
>> > +      if (N00.getOpcode() == ISD::FADD) {
>> > +        SDValue N01 = N0.getOperand(1);
>> > +        SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
>> > +        return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
>> > +      }
>> > +    }
>> >   } // enable-unsafe-fp-math
>> >
>> >   // FADD -> FMA combines:
>> >
>> > Modified: llvm/trunk/test/CodeGen/X86/fp-fast.ll
>> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff <http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff>
>> > ==============================================================================
>> > --- llvm/trunk/test/CodeGen/X86/fp-fast.ll (original)
>> > +++ llvm/trunk/test/CodeGen/X86/fp-fast.ll Tue Apr 28 16:03:22 2015
>> > @@ -113,3 +113,46 @@ define float @test11(float %a) {
>> >   %t2 = fadd float %a, %t1
>> >   ret float %t2
>> > }
>> > +
>> > +; Verify that the first two adds are independent; the destination registers
>> > +; are used as source registers for the third add.
>> > +
>> > +define float @reassociate_adds1(float %a, float %b, float %c, float %d) {
>> > +; CHECK-LABEL: reassociate_adds1:
>> > +; CHECK:       # BB#0:
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    retq
>> > +  %add0 = fadd float %a, %b
>> > +  %add1 = fadd float %add0, %c
>> > +  %add2 = fadd float %add1, %d
>> > +  ret float %add2
>> > +}
>> > +
>> > +define float @reassociate_adds2(float %a, float %b, float %c, float %d) {
>> > +; CHECK-LABEL: reassociate_adds2:
>> > +; CHECK:       # BB#0:
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    retq
>> > +  %add0 = fadd float %a, %b
>> > +  %add1 = fadd float %c, %add0
>> > +  %add2 = fadd float %add1, %d
>> > +  ret float %add2
>> > +}
>> > +
>> > +define float @reassociate_adds3(float %a, float %b, float %c, float %d) {
>> > +; CHECK-LABEL: reassociate_adds3:
>> > +; CHECK:       # BB#0:
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>> > +; CHECK-NEXT:    retq
>> > +  %add0 = fadd float %a, %b
>> > +  %add1 = fadd float %add0, %c
>> > +  %add2 = fadd float %d, %add1
>> > +  ret float %add2
>> > +}
>> > +
>> >
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at cs.uiuc.edu <>
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/5a857a10/attachment.html>