[llvm] r236031 - transform fadd chains to increase parallelism

Sanjay Patel spatel at rotateright.com
Thu Apr 30 08:54:33 PDT 2015


> I don’t know, you may see other reasons why it can even be worse in
InstCombine :)

I was imagining that -instcombine or -reassociate would go all out and find
the perfect balanced tree. This was a suggestion in:
https://llvm.org/bugs/show_bug.cgi?id=17305#c1

Even I saw the potential explosion from that and mentioned it in the review
and commit msg. :)

>  I’m still using bugpoint on the test that goes from 6338 to 18559
instructions locally when applying your change (with D9363).

Awesome! I'm now thinking that we can't do this until post-DAG...as a
machine-level peephole? There's just too much risk of spilling if we do it
sooner.


On Thu, Apr 30, 2015 at 9:29 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Apr 30, 2015, at 8:19 AM, Sanjay Patel <spatel at rotateright.com> wrote:
>
> > Also, isn’t there the exact same register pressure concern in the DAG as
> well? Especially considering the fact that this is target-independent?
>
> Now that the patch is reverted due to a spilling catastrophe, this seems
> obvious. :)
>
>
> I don’t know, you may see other reasons why it can even be worse in
> InstCombine :)
>
>
> I also seem to have completely misjudged the potential compile-time hit. I
> did look at that using test-suite on x86, so I'd love to know if D9363 made
> any difference for your out-of-tree target and/or tests.
>
>
> Well it solves the infinite loop we had locally at least :)
>
> I’m still using bugpoint on the test that goes from 6338 to 18559
> instructions locally when applying your change (with D9363).
>
>
>
> > By the way the name of the function is very generic (ReassociateBinops),
> do you plan to extend it to do perform transformations?
> > Right now it does not carry what it does.
>
> I had a vague feeling that someone, someday would extend it in some
> interesting way, but the truth is really just that I don't know what to
> name this:
>
> //  (op N0: (op N00: (op z, w), N01: y), N1: x) ->
> //  (op N00: (op z, w), (op N1: x, N01: y))
>
>
> I don’t have a great suggestion.
>
>> Mehdi
>
>
>
>
>
>
>
> On Wed, Apr 29, 2015 at 10:03 PM, Mehdi Amini <mehdi.amini at apple.com>
> wrote:
>
>> Hi Sanjay,
>>
>> Thanks for your detailed answer.
>>
>>
>> On Apr 29, 2015, at 8:03 PM, Sanjay Patel <spatel at rotateright.com> wrote:
>>
>> Hi Mehdi,
>>
>> Let me answer the last question first: why not instcombine?
>> Because I noticed this pattern was created by another DAGCombine in:
>> http://reviews.llvm.org/D8941
>>
>> So although doing the fold here does not preclude an instcombine, doing
>> an instcombine alone would not be sufficient.
>>
>>
>> Good point!
>>
>>
>> I'm also not sure that the transform should be done that early because
>> it's not a completely free transform; I think there is a potential register
>> pressure concern.
>>
>>
>>
>> How does the fact that we do it early would increase the register
>> pressure?
>> Also, isn’t there the exact same register pressure concern in the DAG as
>> well? Especially considering the fact that this is target-independent?
>>
>>
>> I agree that the fold should be applied more generally. Please see the
>> follow up patch and post commit thread for r236171. I welcome any advice on
>> how to do this better.
>>
>>
>> I gave a try in http://reviews.llvm.org/D9363 ; let me know what you
>> think.
>> By the way the name of the function is very generic (ReassociateBinops),
>> do you plan to extend it to do perform transformations? Right now it does
>> not carry what it does.
>>
>>
>>
>> Regarding the canonicalization, I wasn't aware of any downside to
>> commuting the operands, so yes, I added it to not have to duplicate the
>> fold logic for all permutations of the operand order. I was also
>> inf-looping in an early draft of the patch, and canonicalization was my
>> solution to avoid that without multiplying the if checks. If these are
>> not valid reasons for the canonicalization, I can fix it.
>>
>>
>> This is introducing a strong canonicalization which is not very
>> constrained. No target would be able to canonicalize on more specific
>> conditions.
>> I don’t know people use to judge if a canonicalization is justified or
>> not, I tend to at least look at a “cost/benefit” analysis:
>> - how does this enable/simplify other transformations?
>> - how does this constrains other transformations (infinite loop problem…)
>>
>> The benefits are unclear to me right now, but that does not mean they
>> don’t exist.
>>
>>
>>>> Mehdi
>>
>>
>>
>>
>>
>> On Wednesday, April 29, 2015, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>
>>> Hello Sanjay,
>>>
>>> In this commit you introduce a new canonicalization in the DAG:
>>>
>>>  (fadd (otherop, fadd)) -> (fadd (fadd, otherop))
>>>
>>> It is not clear to me why this canonicalization is desirable in general
>>> (other than to ease the implementation of your transformation, which does
>>> not seem to justify it alone IMO).
>>>
>>> Also, if such a canonicalization is needed, my feeling is that it
>>> shouldn’t be hidden deeply here, and should probably apply to a bunch of
>>> other binop.
>>>
>>> As a side question: I’m curious why isn’t this been performed in
>>> instcombine?
>>>
>>> Thanks,
>>>
>>> Mehdi
>>>
>>>
>>>
>>> > On Apr 28, 2015, at 2:03 PM, Sanjay Patel <spatel at rotateright.com>
>>> wrote:
>>> >
>>> > Author: spatel
>>> > Date: Tue Apr 28 16:03:22 2015
>>> > New Revision: 236031
>>> >
>>> > URL: http://llvm.org/viewvc/llvm-project?rev=236031&view=rev
>>> > Log:
>>> > transform fadd chains to increase parallelism
>>> >
>>> > This is a compromise: with this simple patch, we should always handle
>>> a chain of exactly 3
>>> > operations optimally, but we're not generating the optimal balanced
>>> binary tree for a longer
>>> > sequence.
>>> >
>>> > In general, this transform will reduce the dependency chain for a
>>> sequence of instructions
>>> > using N operands from a worst case N-1 dependent operations to N/2
>>> dependent operations.
>>> > The optimal balanced binary tree would reduce the chain to log2(N).
>>> >
>>> > The trade-off for not dealing with longer sequences is: (1) we have
>>> less complexity in the
>>> > compiler, (2) we avoid unknown compile-time blowup calculating a
>>> balanced tree, and (3) we
>>> > don't need to worry about the increased register pressure required to
>>> parallelize longer
>>> > sequences. It also seems unlikely that we would ever encounter really
>>> long strings of
>>> > dependent ops like that in the wild, but I'm not sure how to verify
>>> that speculation.
>>> > FWIW, I see no perf difference for test-suite running on btver2
>>> (x86-64) with -ffast-math
>>> > and this patch.
>>> >
>>> > We can extend this patch to cover other associative operations such as
>>> fmul, fmax, fmin,
>>> > integer add, integer mul.
>>> >
>>> > This is a partial fix for:
>>> > https://llvm.org/bugs/show_bug.cgi?id=17305
>>> >
>>> > and if extended:
>>> > https://llvm.org/bugs/show_bug.cgi?id=21768
>>> > https://llvm.org/bugs/show_bug.cgi?id=23116
>>> >
>>> > The issue also came up in:
>>> > http://reviews.llvm.org/D8941
>>> >
>>> > Differential Revision: http://reviews.llvm.org/D9232
>>> >
>>> >
>>> > Modified:
>>> >    llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>>> >    llvm/trunk/test/CodeGen/X86/fp-fast.ll
>>> >
>>> > Modified: llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>>> > URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff
>>> >
>>> ==============================================================================
>>> > --- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)
>>> > +++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Apr 28
>>> 16:03:22 2015
>>> > @@ -7801,6 +7801,24 @@ SDValue DAGCombiner::visitFADD(SDNode *N
>>> >                            N0.getOperand(0), DAG.getConstantFP(4.0,
>>> DL, VT));
>>> >       }
>>> >     }
>>> > +
>>> > +    // Canonicalize chains of adds to LHS to simplify the following
>>> transform.
>>> > +    if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
>>> > +      return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
>>> > +
>>> > +    // Convert a chain of 3 dependent operations into 2 independent
>>> operations
>>> > +    // and 1 dependent operation:
>>> > +    //  (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
>>> > +    //  (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
>>> > +    if (N0.getOpcode() == ISD::FADD &&  N0.hasOneUse() &&
>>> > +        N1.getOpcode() != ISD::FADD) {
>>> > +      SDValue N00 = N0.getOperand(0);
>>> > +      if (N00.getOpcode() == ISD::FADD) {
>>> > +        SDValue N01 = N0.getOperand(1);
>>> > +        SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1,
>>> N01);
>>> > +        return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
>>> > +      }
>>> > +    }
>>> >   } // enable-unsafe-fp-math
>>> >
>>> >   // FADD -> FMA combines:
>>> >
>>> > Modified: llvm/trunk/test/CodeGen/X86/fp-fast.ll
>>> > URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff
>>> >
>>> ==============================================================================
>>> > --- llvm/trunk/test/CodeGen/X86/fp-fast.ll (original)
>>> > +++ llvm/trunk/test/CodeGen/X86/fp-fast.ll Tue Apr 28 16:03:22 2015
>>> > @@ -113,3 +113,46 @@ define float @test11(float %a) {
>>> >   %t2 = fadd float %a, %t1
>>> >   ret float %t2
>>> > }
>>> > +
>>> > +; Verify that the first two adds are independent; the destination
>>> registers
>>> > +; are used as source registers for the third add.
>>> > +
>>> > +define float @reassociate_adds1(float %a, float %b, float %c, float
>>> %d) {
>>> > +; CHECK-LABEL: reassociate_adds1:
>>> > +; CHECK:       # BB#0:
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    retq
>>> > +  %add0 = fadd float %a, %b
>>> > +  %add1 = fadd float %add0, %c
>>> > +  %add2 = fadd float %add1, %d
>>> > +  ret float %add2
>>> > +}
>>> > +
>>> > +define float @reassociate_adds2(float %a, float %b, float %c, float
>>> %d) {
>>> > +; CHECK-LABEL: reassociate_adds2:
>>> > +; CHECK:       # BB#0:
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    retq
>>> > +  %add0 = fadd float %a, %b
>>> > +  %add1 = fadd float %c, %add0
>>> > +  %add2 = fadd float %add1, %d
>>> > +  ret float %add2
>>> > +}
>>> > +
>>> > +define float @reassociate_adds3(float %a, float %b, float %c, float
>>> %d) {
>>> > +; CHECK-LABEL: reassociate_adds3:
>>> > +; CHECK:       # BB#0:
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT:    vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT:    retq
>>> > +  %add0 = fadd float %a, %b
>>> > +  %add1 = fadd float %add0, %c
>>> > +  %add2 = fadd float %d, %add1
>>> > +  ret float %add2
>>> > +}
>>> > +
>>> >
>>> >
>>> > _______________________________________________
>>> > llvm-commits mailing list
>>> > llvm-commits at cs.uiuc.edu
>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/31d10401/attachment.html>


More information about the llvm-commits mailing list