[llvm] r236031 - transform fadd chains to increase parallelism
Mehdi Amini
mehdi.amini at apple.com
Thu Apr 30 10:49:11 PDT 2015
Hi Sanjay,
If you’re interested in examples, attached are two .ll files that shows a regression with your transformation in term of instructions count.
Best,
Mehdi
> On Apr 30, 2015, at 8:54 AM, Sanjay Patel <spatel at rotateright.com> wrote:
>
> > I don’t know, you may see other reasons why it can even be worse in InstCombine :)
>
> I was imagining that -instcombine or -reassociate would go all out and find the perfect balanced tree. This was a suggestion in:
> https://llvm.org/bugs/show_bug.cgi?id=17305#c1 <https://llvm.org/bugs/show_bug.cgi?id=17305#c1>
>
> Even I saw the potential explosion from that and mentioned it in the review and commit msg. :)
>
> > I’m still using bugpoint on the test that goes from 6338 to 18559 instructions locally when applying your change (with D9363).
>
> Awesome! I'm now thinking that we can't do this until post-DAG...as a machine-level peephole? There's just too much risk of spilling if we do it sooner.
>
>
> On Thu, Apr 30, 2015 at 9:29 AM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote:
>
>> On Apr 30, 2015, at 8:19 AM, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>>
>> > Also, isn’t there the exact same register pressure concern in the DAG as well? Especially considering the fact that this is target-independent?
>>
>> Now that the patch is reverted due to a spilling catastrophe, this seems obvious. :)
>
> I don’t know, you may see other reasons why it can even be worse in InstCombine :)
>
>>
>> I also seem to have completely misjudged the potential compile-time hit. I did look at that using test-suite on x86, so I'd love to know if D9363 made any difference for your out-of-tree target and/or tests.
>
> Well it solves the infinite loop we had locally at least :)
>
> I’m still using bugpoint on the test that goes from 6338 to 18559 instructions locally when applying your change (with D9363).
>
>>
>>
>> > By the way the name of the function is very generic (ReassociateBinops), do you plan to extend it to do perform transformations?
>> > Right now it does not carry what it does.
>>
>> I had a vague feeling that someone, someday would extend it in some interesting way, but the truth is really just that I don't know what to name this:
>>
>> // (op N0: (op N00: (op z, w), N01: y), N1: x) ->
>> // (op N00: (op z, w), (op N1: x, N01: y))
>
> I don’t have a great suggestion.
>
> —
> Mehdi
>
>
>>
>>
>>
>>
>>
>> On Wed, Apr 29, 2015 at 10:03 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote:
>> Hi Sanjay,
>>
>> Thanks for your detailed answer.
>>
>>
>>> On Apr 29, 2015, at 8:03 PM, Sanjay Patel <spatel at rotateright.com <mailto:spatel at rotateright.com>> wrote:
>>>
>>> Hi Mehdi,
>>>
>>> Let me answer the last question first: why not instcombine?
>>> Because I noticed this pattern was created by another DAGCombine in:
>>> http://reviews.llvm.org/D8941 <http://reviews.llvm.org/D8941>
>>>
>>> So although doing the fold here does not preclude an instcombine, doing an instcombine alone would not be sufficient.
>>
>> Good point!
>>
>>
>>> I'm also not sure that the transform should be done that early because it's not a completely free transform; I think there is a potential register pressure concern.
>>
>>
>> How does the fact that we do it early would increase the register pressure?
>> Also, isn’t there the exact same register pressure concern in the DAG as well? Especially considering the fact that this is target-independent?
>>
>>>
>>> I agree that the fold should be applied more generally. Please see the follow up patch and post commit thread for r236171. I welcome any advice on how to do this better.
>>
>> I gave a try in http://reviews.llvm.org/D9363 <http://reviews.llvm.org/D9363> ; let me know what you think.
>> By the way the name of the function is very generic (ReassociateBinops), do you plan to extend it to do perform transformations? Right now it does not carry what it does.
>>
>>
>>>
>>> Regarding the canonicalization, I wasn't aware of any downside to commuting the operands, so yes, I added it to not have to duplicate the fold logic for all permutations of the operand order. I was also inf-looping in an early draft of the patch, and canonicalization was my solution to avoid that without multiplying the if checks. If these are not valid reasons for the canonicalization, I can fix it.
>>
>> This is introducing a strong canonicalization which is not very constrained. No target would be able to canonicalize on more specific conditions.
>> I don’t know people use to judge if a canonicalization is justified or not, I tend to at least look at a “cost/benefit” analysis:
>> - how does this enable/simplify other transformations?
>> - how does this constrains other transformations (infinite loop problem…)
>>
>> The benefits are unclear to me right now, but that does not mean they don’t exist.
>>
>>
>> —
>> Mehdi
>>
>>
>>
>>>
>>>
>>> On Wednesday, April 29, 2015, Mehdi Amini <mehdi.amini at apple.com <>> wrote:
>>> Hello Sanjay,
>>>
>>> In this commit you introduce a new canonicalization in the DAG:
>>>
>>> (fadd (otherop, fadd)) -> (fadd (fadd, otherop))
>>>
>>> It is not clear to me why this canonicalization is desirable in general (other than to ease the implementation of your transformation, which does not seem to justify it alone IMO).
>>>
>>> Also, if such a canonicalization is needed, my feeling is that it shouldn’t be hidden deeply here, and should probably apply to a bunch of other binop.
>>>
>>> As a side question: I’m curious why isn’t this been performed in instcombine?
>>>
>>> Thanks,
>>>
>>> Mehdi
>>>
>>>
>>>
>>> > On Apr 28, 2015, at 2:03 PM, Sanjay Patel <spatel at rotateright.com <>> wrote:
>>> >
>>> > Author: spatel
>>> > Date: Tue Apr 28 16:03:22 2015
>>> > New Revision: 236031
>>> >
>>> > URL: http://llvm.org/viewvc/llvm-project?rev=236031&view=rev <http://llvm.org/viewvc/llvm-project?rev=236031&view=rev>
>>> > Log:
>>> > transform fadd chains to increase parallelism
>>> >
>>> > This is a compromise: with this simple patch, we should always handle a chain of exactly 3
>>> > operations optimally, but we're not generating the optimal balanced binary tree for a longer
>>> > sequence.
>>> >
>>> > In general, this transform will reduce the dependency chain for a sequence of instructions
>>> > using N operands from a worst case N-1 dependent operations to N/2 dependent operations.
>>> > The optimal balanced binary tree would reduce the chain to log2(N).
>>> >
>>> > The trade-off for not dealing with longer sequences is: (1) we have less complexity in the
>>> > compiler, (2) we avoid unknown compile-time blowup calculating a balanced tree, and (3) we
>>> > don't need to worry about the increased register pressure required to parallelize longer
>>> > sequences. It also seems unlikely that we would ever encounter really long strings of
>>> > dependent ops like that in the wild, but I'm not sure how to verify that speculation.
>>> > FWIW, I see no perf difference for test-suite running on btver2 (x86-64) with -ffast-math
>>> > and this patch.
>>> >
>>> > We can extend this patch to cover other associative operations such as fmul, fmax, fmin,
>>> > integer add, integer mul.
>>> >
>>> > This is a partial fix for:
>>> > https://llvm.org/bugs/show_bug.cgi?id=17305 <https://llvm.org/bugs/show_bug.cgi?id=17305>
>>> >
>>> > and if extended:
>>> > https://llvm.org/bugs/show_bug.cgi?id=21768 <https://llvm.org/bugs/show_bug.cgi?id=21768>
>>> > https://llvm.org/bugs/show_bug.cgi?id=23116 <https://llvm.org/bugs/show_bug.cgi?id=23116>
>>> >
>>> > The issue also came up in:
>>> > http://reviews.llvm.org/D8941 <http://reviews.llvm.org/D8941>
>>> >
>>> > Differential Revision: http://reviews.llvm.org/D9232 <http://reviews.llvm.org/D9232>
>>> >
>>> >
>>> > Modified:
>>> > llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>>> > llvm/trunk/test/CodeGen/X86/fp-fast.ll
>>> >
>>> > Modified: llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
>>> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff <http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp?rev=236031&r1=236030&r2=236031&view=diff>
>>> > ==============================================================================
>>> > --- llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (original)
>>> > +++ llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Tue Apr 28 16:03:22 2015
>>> > @@ -7801,6 +7801,24 @@ SDValue DAGCombiner::visitFADD(SDNode *N
>>> > N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));
>>> > }
>>> > }
>>> > +
>>> > + // Canonicalize chains of adds to LHS to simplify the following transform.
>>> > + if (N0.getOpcode() != ISD::FADD && N1.getOpcode() == ISD::FADD)
>>> > + return DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N0);
>>> > +
>>> > + // Convert a chain of 3 dependent operations into 2 independent operations
>>> > + // and 1 dependent operation:
>>> > + // (fadd N0: (fadd N00: (fadd z, w), N01: y), N1: x) ->
>>> > + // (fadd N00: (fadd z, w), (fadd N1: x, N01: y))
>>> > + if (N0.getOpcode() == ISD::FADD && N0.hasOneUse() &&
>>> > + N1.getOpcode() != ISD::FADD) {
>>> > + SDValue N00 = N0.getOperand(0);
>>> > + if (N00.getOpcode() == ISD::FADD) {
>>> > + SDValue N01 = N0.getOperand(1);
>>> > + SDValue NewAdd = DAG.getNode(ISD::FADD, SDLoc(N), VT, N1, N01);
>>> > + return DAG.getNode(ISD::FADD, SDLoc(N), VT, N00, NewAdd);
>>> > + }
>>> > + }
>>> > } // enable-unsafe-fp-math
>>> >
>>> > // FADD -> FMA combines:
>>> >
>>> > Modified: llvm/trunk/test/CodeGen/X86/fp-fast.ll
>>> > URL: http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff <http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/fp-fast.ll?rev=236031&r1=236030&r2=236031&view=diff>
>>> > ==============================================================================
>>> > --- llvm/trunk/test/CodeGen/X86/fp-fast.ll (original)
>>> > +++ llvm/trunk/test/CodeGen/X86/fp-fast.ll Tue Apr 28 16:03:22 2015
>>> > @@ -113,3 +113,46 @@ define float @test11(float %a) {
>>> > %t2 = fadd float %a, %t1
>>> > ret float %t2
>>> > }
>>> > +
>>> > +; Verify that the first two adds are independent; the destination registers
>>> > +; are used as source registers for the third add.
>>> > +
>>> > +define float @reassociate_adds1(float %a, float %b, float %c, float %d) {
>>> > +; CHECK-LABEL: reassociate_adds1:
>>> > +; CHECK: # BB#0:
>>> > +; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT: retq
>>> > + %add0 = fadd float %a, %b
>>> > + %add1 = fadd float %add0, %c
>>> > + %add2 = fadd float %add1, %d
>>> > + ret float %add2
>>> > +}
>>> > +
>>> > +define float @reassociate_adds2(float %a, float %b, float %c, float %d) {
>>> > +; CHECK-LABEL: reassociate_adds2:
>>> > +; CHECK: # BB#0:
>>> > +; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT: retq
>>> > + %add0 = fadd float %a, %b
>>> > + %add1 = fadd float %c, %add0
>>> > + %add2 = fadd float %add1, %d
>>> > + ret float %add2
>>> > +}
>>> > +
>>> > +define float @reassociate_adds3(float %a, float %b, float %c, float %d) {
>>> > +; CHECK-LABEL: reassociate_adds3:
>>> > +; CHECK: # BB#0:
>>> > +; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
>>> > +; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
>>> > +; CHECK-NEXT: retq
>>> > + %add0 = fadd float %a, %b
>>> > + %add1 = fadd float %add0, %c
>>> > + %add2 = fadd float %d, %add1
>>> > + ret float %add2
>>> > +}
>>> > +
>>> >
>>> >
>>> > _______________________________________________
>>> > llvm-commits mailing list
>>> > llvm-commits at cs.uiuc.edu <>
>>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits <http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bugpointed-x86.ll
Type: application/octet-stream
Size: 3004 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spill.ll
Type: application/octet-stream
Size: 648873 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150430/98d7ab17/attachment-0002.html>
More information about the llvm-commits
mailing list