[llvm] r236031 - transform fadd chains to increase parallelism

Thu Apr 30 10:49:11 PDT 2015

Hi Sanjay,

If you’re interested in examples, attached are two .ll files that shows a regression with your transformation in term of instructions count.

Best,

Mehdi

> On Apr 30, 2015, at 8:54 AM, Sanjay Patel <spatel at rotateright.com> wrote:
> 
> > I don’t know, you may see other reasons why it can even be worse in InstCombine :)
> 
> I was imagining that -instcombine or -reassociate would go all out and find the perfect balanced tree. This was a suggestion in:
> https://llvm.org/bugs/show_bug.cgi?id=17305#c1 <https://llvm.org/bugs/show_bug.cgi?id=17305#c1>
> 
> Even I saw the potential explosion from that and mentioned it in the review and commit msg. :)
> 
> >  I’m still using bugpoint on the test that goes from 6338 to 18559 instructions locally when applying your change (with D9363).
> 
> Awesome! I'm now thinking that we can't do this until post-DAG...as a machine-level peephole? There's just too much risk of spilling if we do it sooner.
> 
> 
> On Thu, Apr 30, 2015 at 9:29 AM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote:
> 
>> On Apr 30, 2015, at 8:19 AM, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>> 
>> > Also, isn’t there the exact same register pressure concern in the DAG as well? Especially considering the fact that this is target-independent?
>> 
>> Now that the patch is reverted due to a spilling catastrophe, this seems obvious. :)
> 
> I don’t know, you may see other reasons why it can even be worse in InstCombine :)
> 
>> 
>> I also seem to have completely misjudged the potential compile-time hit. I did look at that using test-suite on x86, so I'd love to know if D9363 made any difference for your out-of-tree target and/or tests.
> 
> Well it solves the infinite loop we had locally at least :)
> 
> I’m still using bugpoint on the test that goes from 6338 to 18559 instructions locally when applying your change (with D9363).
> 
>> 
>> 
>> > By the way the name of the function is very generic (ReassociateBinops), do you plan to extend it to do perform transformations?
>> > Right now it does not carry what it does.
>> 
>> I had a vague feeling that someone, someday would extend it in some interesting way, but the truth is really just that I don't know what to name this:
>> 
>> //  (op N0: (op N00: (op z, w), N01: y), N1: x) ->
>> //  (op N00: (op z, w), (op N1: x, N01: y))
> 
> I don’t have a great suggestion.
> 
> — 
> Mehdi
> 
> 
>> 
>> 
>> 
>> 
>> 
>> On Wed, Apr 29, 2015 at 10:03 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote:
>> Hi Sanjay, 
>> 
>> Thanks for your detailed answer.
>> 
>> 
>>> On Apr 29, 2015, at 8:03 PM, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>>> 
>>> Hi Mehdi,
>>> 
>>> Let me answer the last question first: why not instcombine?
>>> Because I noticed this pattern was created by another DAGCombine in:
>>> http://reviews.llvm.org/D8941 <http://reviews.llvm.org/D8941>
>>> 
>>> So although doing the fold here does not preclude an instcombine, doing an instcombine alone would not be sufficient.
>> 
>> Good point!
>> 
>> 
>>> I'm also not sure that the transform should be done that early because it's not a completely free transform; I think there is a potential register pressure concern.
>> 
>> 
>> How does the fact that we do it early would increase the register pressure? 
>> Also, isn’t there the exact same register pressure concern in the DAG as well? Especially considering the fact that this is target-independent?
>> 
>>> 
>>> I agree that the fold should be applied more generally. Please see the follow up patch and post commit thread for r236171. I welcome any advice on how to do this better.
>> 
>> I gave a try in http://reviews.llvm.org/D9363 <http://reviews.llvm.org/D9363> ; let me know what you think.
>> By the way the name of the function is very generic (ReassociateBinops), do you plan to extend it to do perform transformations? Right now it does not carry what it does.
>> 
>> 
>>> 
>>> Regarding the canonicalization, I wasn't aware of any downside to commuting the operands, so yes, I added it to not have to duplicate the fold logic for all permutations of the operand order. I was also inf-looping in an early draft of the patch, and canonicalization was my solution to avoid that without multiplying the if checks. If these are not valid reasons for the canonicalization, I can fix it.
>> 
>> This is introducing a strong canonicalization which is not very constrained. No target would be able to canonicalize on more specific conditions. 
>> I don’t know people use to judge if a canonicalization is justified or not, I tend to at least look at a “cost/benefit” analysis: 
>> - how does this enable/simplify other transformations?
>> - how does this constrains other transformations (infinite loop problem…)
>> 
>> The benefits are unclear to me right now, but that does not mean they don’t exist.
>> 
>> 
>> — 
>> Mehdi
>> 
>> 
>> 
>>> 
>>> 
>>> On Wednesday, April 29, 2015, Mehdi Amini <mehdi.amini at apple.com <>> wrote:
>>> Hello Sanjay,
>>> 
>>> In this commit you introduce a new canonicalization in the DAG:
>>> 
>>>  (fadd (otherop, fadd)) -> (fadd (fadd, otherop))
>>> 
>>> It is not clear to me why this canonicalization is desirable in general (other than to ease the implementation of your transformation, which does not seem to justify it alone IMO).
>>> 
>>> Also, if such a canonicalization is needed, my feeling is that it shouldn’t be hidden deeply here, and should probably apply to a bunch of other binop.
>>> 
>>> As a side question: I’m curious why isn’t this been performed in instcombine?
>>> 
>>> Thanks,
>>> 
>>> Mehdi
>>> 
>>> 
>>> 
>>> > On Apr 28, 2015, at 2:03 PM, Sanjay Patel <spatel at rotateright.com <>> wrote:
>>> >
>>> > Author: spatel
>>> > Date: Tue Apr 28 16:03:22 2015
>>> > New Revision: 236031
>>> >
>>> > URL: http://llvm.org/viewvc/llvm-project?rev=236031&view=rev <http://llvm.org/viewvc/llvm-project?rev=236031&view=rev>
>>> > Log:
>>> > transform fadd chains to increase parallelism
>>> >
>>> > This is a compromise: with this simple patch, we should always handle a chain of exactly 3
>>> > operations optimally, but we're not generating the optimal balanced binary tree for a longer
>>> > sequence.
>>> >
>>> > In general, this transform will reduce the dependency chain for a sequence of instructions
>>> > using N operands from a worst case N-1 dependent operations to N/2 dependent operations.
>>> > The optimal balanced binary tree would reduce the chain to log2(N).
>>> >
>>> > The trade-off for not dealing with longer sequences is: (1) we have less complexity in the
>>> > compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we
>>> > don't need to worry about the increased register pressure required to parallelize longer
>>> > sequences. It also seems unlikely that we would ever encounter really long strings of
>>> > dependent ops like that in the wild, but I'm not sure how to verify that speculation.
>>> > FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math
>>> > and this patch.
>>> >
>>> > We can extend this patch to cover other associative operations such as fmul, fmax, fmin,
>>> > integer add, integer mul.
>>> >
>>> > This is a partial fix for:
>>> > https://llvm.org/bugs/show_bug.cgi?id=17305 <https://llvm.org/bugs/show_bug.cgi?id=17305>
>>> >
>>> > and if extended:
>>> > https://llvm.org/bugs/show_bug.cgi?id=21768 <https://llvm.org/bugs/show_bug.cgi?id=21768>
>>> > https://llvm.org/bugs/show_bug.cgi?id=23116 <https://llvm.org/bugs/show_bug.cgi?id=23116>
>>> >
>>> > The issue also came up in:
>>> > http://reviews.llvm.org/D8941 <http://reviews.llvm.org/D8941>
>>> >
>>> > Differential Revision: http://reviews.llvm.org/D9232 <http://reviews.llvm.org/D9232>
>>> >
>>> >
>>> > Modified:
>>> >    llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>>> >    llvm/trunk/test/CodeGen/X86/fp-fast.ll
>>> >
>>> > Modified: llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>>> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff <http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff>
>>> > ==============================================================================
>>> > --- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)
>>> > +++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Apr 28 16:03:22 2015
>>> > @@ -7801,6 +7801,24 @@ SDValue DAGCombiner::visitFADD(SDNode *N
>>> >                            N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));
>>> >       }
>>> >     }
>>> > +
>>> > +    // Canonicalize chains of adds to LHS to simplify the following transform.
>>> > +    if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
>>> > +      return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
>>> > +
>>> > +    // Convert a chain of 3 dependent operations into 2 independent operations
>>> > +    // and 1 dependent operation:
>>> > +    //  (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
>>> > +    //  (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
>>> > +    if (N0.getOpcode() == ISD::FADD &&  N0.hasOneUse() &&
>>> > +        N1.getOpcode() != ISD::FADD) {
>>> > +      SDValue N00 = N0.getOperand(0);
>>> > +      if (N00.getOpcode() == ISD::FADD) {
>>> > +        SDValue N01 = N0.getOperand(1);
>>> > +        SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
>>> > +        return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
>>> > +      }
>>> > +    }
>>> >   } // enable-unsafe-fp-math
>>> >
>>> >   // FADD -> FMA combines:
>>> >
>>> > Modified: llvm/trunk/test/CodeGen/X86/fp-fast.ll
>>> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff <http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff>
>>> > ==============================================================================
>>> > --- llvm/trunk/test/CodeGen/X86/fp-fast.ll (original)
>>> > +++ llvm/trunk/test/CodeGen/X86/fp-fast.ll Tue Apr 28 16:03:22 2015
>>> > @@ -113,3 +113,46 @@ define float @test11(float %a) {
>>> >   %t2 = fadd float %a, %t1
>>> >   ret float %t2
>>> > }
>>> > +
>>> > +; Verify that the first two adds are independent; the destination registers
>>> > +; are used as source registers for the third add.
>>> > +
>>> > +define float @reassociate_adds1(float %a, float %b, float %c, float %d) {
>>> > +; CHECK-LABEL: reassociate_adds1:
>>> > +; CHECK:       # BB#0:
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    retq
>>> > +  %add0 = fadd float %a, %b
>>> > +  %add1 = fadd float %add0, %c
>>> > +  %add2 = fadd float %add1, %d
>>> > +  ret float %add2
>>> > +}
>>> > +
>>> > +define float @reassociate_adds2(float %a, float %b, float %c, float %d) {
>>> > +; CHECK-LABEL: reassociate_adds2:
>>> > +; CHECK:       # BB#0:
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    retq
>>> > +  %add0 = fadd float %a, %b
>>> > +  %add1 = fadd float %c, %add0
>>> > +  %add2 = fadd float %add1, %d
>>> > +  ret float %add2
>>> > +}
>>> > +
>>> > +define float @reassociate_adds3(float %a, float %b, float %c, float %d) {
>>> > +; CHECK-LABEL: reassociate_adds3:
>>> > +; CHECK:       # BB#0:
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    retq
>>> > +  %add0 = fadd float %a, %b
>>> > +  %add1 = fadd float %add0, %c
>>> > +  %add2 = fadd float %d, %add1
>>> > +  ret float %add2
>>> > +}
>>> > +
>>> >
>>> >
>>> > _______________________________________________
>>> > llvm-commits mailing list
>>> > llvm-commits at cs.uiuc.edu <>
>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>
>>> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bugpointed-x86.ll
Type: application/octet-stream
Size: 3004 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spill.ll
Type: application/octet-stream
Size: 648873 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment-0002.html>