[llvm] r211771 - [X86] Improve the selection of SSE3/AVX addsub instructions.

Wed Jul 23 03:52:33 PDT 2014

Hi Chandler,

On Wed, Jul 23, 2014 at 7:18 AM, Chandler Carruth <chandlerc at google.com> wrote:
> Sorry for arriving late to the party here, but I've been hacking on the DAG
> combiner and it has uncovered what is (I think) a fundamental flaw in the
> current approach here.
>
> You are creating a target DAG combine that will actually reverse (and be
> reversed by) the target-independent canonicalizing DAG combine. This
> essentially creates the potential for a combine-cycle if the newly adjusted
> nodes ever get reliably added to the combiner's worklist. Today, this
> doesn't happen because we don't do a good job of rigorously getting newly
> formed nodes onto the worklist. My work is changing that and exposing this
> issue, but it was bound to happen eventually. Essentially, I don't think you
> can do this with combining and forming a target-specific canonical form just
> to trigger an instruction selector pattern.
>
> I also don't think that's the right basic approach. Instead, this should be
> implemented during the lowering of these patterns. I could imagine doing
> this as part of custom lowering of addsub nodes, or as part of lowering
> shuffles which consume addsub nodes to bypass actual shuffle instructions by
> reversing the math operation. I like the latter best I think.

I will revert this change for now so that you get un-blocked.
Your idea of moving the logic that matches addsub as part of the
lowering of shuffles make sense to me and I will try to work in that
direction.

>
> Can you switch to such an approach promptly? I'm testing my DAG combine
> improvements locally without this commit, but want to get un-blocked...
>

If ok, I revert this change now. As soon as I have what I think may be
a reasonable fix for this I will then send a patch for review.

>
> Also (and much more long term), I wonder if there is a more fundamental
> problem with how you're approaching addsub matching: it will *only* work
> when the user "unshuffles" the result somehow. I think this is extremely
> limiting. Really, we should *form* shuffles of add and sub operands such
> that we can use the addsub instructions. Then, when the user has the
> shuffles in their code directly (or when one of our vectorizers specifically
> forms them) the shuffles will combine away as redundant. Does that make
> sense?

I think it makes sense. Ideally we should mostly rely on the
vectorizers when it comes to form new shuffles.
Only recently, the SLP has become able to identify some horizontal
add/sub patterns forming new shuffles (so called "alternate
shuffles").
My current plan is to understand how to improve/extend that logic so
that we can identify most of the add/sub patterns before we reach the
backend. Once it is done, most of the logic to match addsub in
X86ISelLowering.cpp can be safely reverted.

Cheers,
Andrea

>
>
> On Thu, Jun 26, 2014 at 3:45 AM, Andrea Di Biagio
> <Andrea_DiBiagio at sn.scee.net> wrote:
>>
>> Author: adibiagio
>> Date: Thu Jun 26 05:45:21 2014
>> New Revision: 211771
>>
>> URL: http://llvm.org/viewvc/llvm-project?rev=211771&view=rev
>> Log:
>> [X86] Improve the selection of SSE3/AVX addsub instructions.
>>
>> This patch teaches the backend how to canonicalize a shuffle vectors
>> according to the rule:
>>
>>  - (shuffle (FADD A, B), (FSUB A, B), Mask) ->
>>        (shuffle (FSUB A, -B), (FADD A, -B), Mask)
>>
>> Where 'Mask' is:
>>   <0,5,2,7>            ;; for v4f32 and v4f64 shuffles.
>>   <0,3>                ;; for v2f64 shuffles.
>>   <0,9,2,11,4,13,6,15> ;; for v8f32 shuffles.
>>
>> In general, ISel only knows how to pattern-match a canonical
>> 'fadd + fsub + blendi' dag node sequence into an ADDSUB instruction.
>>
>> This new rule allows to convert a non-canonical dag sequence into a
>> canonical one that will be matched by a single ADDSUB at ISel stage.
>>
>> The idea of converting a non-canonical ADDSUB into a canonical one by
>> swapping the first two operands of the shuffle, and then negating the
>> second operand of the FADD and FSUB, was originally proposed by Hal
>> Finkel.
>>
>>
>> Modified:
>>     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>>     llvm/trunk/test/CodeGen/X86/sse3-avx-addsub.ll
>>
>> Modified: llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Target/X86/X86ISelLowering.cpp?rev=211771&r1=211770&r2=211771&view=diff
>>
>> ==============================================================================
>> --- llvm/trunk/lib/Target/X86/X86ISelLowering.cpp (original)
>> +++ llvm/trunk/lib/Target/X86/X86ISelLowering.cpp Thu Jun 26 05:45:21 2014
>> @@ -18014,6 +18014,49 @@ static SDValue PerformShuffleCombine(SDN
>>    SDValue N1 = N->getOperand(1);
>>    EVT VT = N->getValueType(0);
>>
>> +  // Canonicalize shuffles that perform 'addsub' on packed float vectors
>> +  // according to the rule:
>> +  //  (shuffle (FADD A, B), (FSUB A, B), Mask) ->
>> +  //  (shuffle (FSUB A, -B), (FADD A, -B), Mask)
>> +  //
>> +  // Where 'Mask' is:
>> +  //  <0,5,2,7>             -- for v4f32 and v4f64 shuffles;
>> +  //  <0,3>                 -- for v2f64 shuffles;
>> +  //  <0,9,2,11,4,13,6,15>  -- for v8f32 shuffles.
>> +  //
>> +  // This helps pattern-matching more SSE3/AVX ADDSUB instructions
>> +  // during ISel stage.
>> +  if (N->getOpcode() == ISD::VECTOR_SHUFFLE &&
>> +      ((Subtarget->hasSSE3() && (VT == MVT::v4f32 || VT == MVT::v2f64))
>> ||
>> +       (Subtarget->hasAVX() && (VT == MVT::v8f32 || VT == MVT::v4f64)))
>> &&
>> +      N0->getOpcode() == ISD::FADD && N1->getOpcode() == ISD::FSUB &&
>> +      // Operands to the FADD and FSUB must be the same.
>> +      ((N0->getOperand(0) == N1->getOperand(0) &&
>> +        N0->getOperand(1) == N1->getOperand(1)) ||
>> +       // FADD is commutable. See if by commuting the operands of the
>> FADD
>> +       // we would still be able to match the operands of the FSUB dag
>> node.
>> +       (N0->getOperand(1) == N1->getOperand(0) &&
>> +        N0->getOperand(0) == N1->getOperand(1))) &&
>> +      N0->getOperand(0)->getOpcode() != ISD::UNDEF &&
>> +      N0->getOperand(1)->getOpcode() != ISD::UNDEF) {
>> +
>> +    ShuffleVectorSDNode *SV = cast<ShuffleVectorSDNode>(N);
>> +    unsigned NumElts = VT.getVectorNumElements();
>> +    ArrayRef<int> Mask = SV->getMask();
>> +    bool CanFold = true;
>> +
>> +    for (unsigned i = 0, e = NumElts; i != e && CanFold; ++i)
>> +      CanFold = Mask[i] == (i & 1) ? i + NumElts : i;
>> +
>> +    if (CanFold) {
>> +      SDValue Op0 = N1->getOperand(0);
>> +      SDValue Op1 = DAG.getNode(ISD::FNEG, dl, VT, N1->getOperand(1));
>> +      SDValue Sub = DAG.getNode(ISD::FSUB, dl, VT, Op0, Op1);
>> +      SDValue Add = DAG.getNode(ISD::FADD, dl, VT, Op0, Op1);
>> +      return DAG.getVectorShuffle(VT, dl, Sub, Add, Mask);
>> +    }
>> +  }
>> +
>>    // Don't create instructions with illegal types after legalize types
>> has run.
>>    const TargetLowering &TLI = DAG.getTargetLoweringInfo();
>>    if (!DCI.isBeforeLegalize() &&
>> !TLI.isTypeLegal(VT.getVectorElementType()))
>>
>> Modified: llvm/trunk/test/CodeGen/X86/sse3-avx-addsub.ll
>> URL:
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/sse3-avx-addsub.ll?rev=211771&r1=211770&r2=211771&view=diff
>>
>> ==============================================================================
>> --- llvm/trunk/test/CodeGen/X86/sse3-avx-addsub.ll (original)
>> +++ llvm/trunk/test/CodeGen/X86/sse3-avx-addsub.ll Thu Jun 26 05:45:21
>> 2014
>> @@ -3,7 +3,7 @@
>>
>>  ; Test ADDSUB ISel patterns.
>>
>> -; All the functions below are obtained from the following source:
>> +; Functions below are obtained from the following source:
>>  ;
>>  ; typedef double double2 __attribute__((ext_vector_type(2)));
>>  ; typedef double double4 __attribute__((ext_vector_type(4)));
>> @@ -19,7 +19,7 @@
>>  ; float8 test2(float8 A, float8 B) {
>>  ;   float8 X = A - B;
>>  ;   float8 Y = A + B;
>> -;   return (float8){X[0], Y[1], X[2], Y[3], X[4], Y[5], X[6], [7]};
>> +;   return (float8){X[0], Y[1], X[2], Y[3], X[4], Y[5], X[6], Y[7]};
>>  ; }
>>  ;
>>  ; double4 test3(double4 A, double4 B) {
>> @@ -141,3 +141,156 @@ define <2 x double> @test4b(<2 x double>
>>  ; AVX: vaddsubpd
>>  ; CHECK-NEXT: ret
>>
>> +; Functions below are obtained from the following source:
>> +;
>> +; float4 test1(float4 A, float4 B) {
>> +;   float4 X = A + B;
>> +;   float4 Y = A - B;
>> +;   return (float4){X[0], Y[1], X[2], Y[3]};
>> +; }
>> +;
>> +; float8 test2(float8 A, float8 B) {
>> +;   float8 X = A + B;
>> +;   float8 Y = A - B;
>> +;   return (float8){X[0], Y[1], X[2], Y[3], X[4], Y[5], X[6], Y[7]};
>> +; }
>> +;
>> +; double4 test3(double4 A, double4 B) {
>> +;   double4 X = A + B;
>> +;   double4 Y = A - B;
>> +;   return (double4){X[0], Y[1], X[2], Y[3]};
>> +; }
>> +;
>> +; double2 test4(double2 A, double2 B) {
>> +;   double2 X = A + B;
>> +;   double2 Y = A - B;
>> +;   return (double2){X[0], Y[1]};
>> +; }
>> +
>> +define <4 x float> @test5(<4 x float> %A, <4 x float> %B) {
>> +  %sub = fsub <4 x float> %A, %B
>> +  %add = fadd <4 x float> %A, %B
>> +  %vecinit6 = shufflevector <4 x float> %add, <4 x float> %sub, <4 x i32>
>> <i32 0, i32 5, i32 2, i32 7>
>> +  ret <4 x float> %vecinit6
>> +}
>> +; CHECK-LABEL: test5
>> +; SSE: xorps
>> +; SSE-NEXT: addsubps
>> +; AVX: vxorps
>> +; AVX-NEXT: vaddsubps
>> +; CHECK: ret
>> +
>> +
>> +define <8 x float> @test6(<8 x float> %A, <8 x float> %B) {
>> +  %sub = fsub <8 x float> %A, %B
>> +  %add = fadd <8 x float> %A, %B
>> +  %vecinit14 = shufflevector <8 x float> %add, <8 x float> %sub, <8 x
>> i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
>> +  ret <8 x float> %vecinit14
>> +}
>> +; CHECK-LABEL: test6
>> +; SSE: xorps
>> +; SSE-NEXT: addsubps
>> +; SSE: xorps
>> +; SSE-NEXT: addsubps
>> +; AVX: vxorps
>> +; AVX-NEXT: vaddsubps
>> +; AVX-NOT: vxorps
>> +; AVX-NOT: vaddsubps
>> +; CHECK: ret
>> +
>> +
>> +define <4 x double> @test7(<4 x double> %A, <4 x double> %B) {
>> +  %sub = fsub <4 x double> %A, %B
>> +  %add = fadd <4 x double> %A, %B
>> +  %vecinit6 = shufflevector <4 x double> %add, <4 x double> %sub, <4 x
>> i32> <i32 0, i32 5, i32 2, i32 7>
>> +  ret <4 x double> %vecinit6
>> +}
>> +; CHECK-LABEL: test7
>> +; SSE: xorpd
>> +; SSE-NEXT: addsubpd
>> +; SSE: xorpd
>> +; SSE-NEXT: addsubpd
>> +; AVX: vxorpd
>> +; AVX-NEXT: vaddsubpd
>> +; AVX-NOT: vxorpd
>> +; AVX-NOT: vaddsubpd
>> +; CHECK: ret
>> +
>> +
>> +define <2 x double> @test8(<2 x double> %A, <2 x double> %B) #0 {
>> +  %add = fadd <2 x double> %A, %B
>> +  %sub = fsub <2 x double> %A, %B
>> +  %vecinit2 = shufflevector <2 x double> %add, <2 x double> %sub, <2 x
>> i32> <i32 0, i32 3>
>> +  ret <2 x double> %vecinit2
>> +}
>> +; CHECK-LABEL: test8
>> +; SSE: xorpd
>> +; SSE-NEXT: addsubpd
>> +; AVX: vxorpd
>> +; AVX-NEXT: vaddsubpd
>> +; CHECK: ret
>> +
>> +
>> +define <4 x float> @test5b(<4 x float> %A, <4 x float> %B) {
>> +  %sub = fsub <4 x float> %A, %B
>> +  %add = fadd <4 x float> %B, %A
>> +  %vecinit6 = shufflevector <4 x float> %add, <4 x float> %sub, <4 x i32>
>> <i32 0, i32 5, i32 2, i32 7>
>> +  ret <4 x float> %vecinit6
>> +}
>> +; CHECK-LABEL: test5
>> +; SSE: xorps
>> +; SSE-NEXT: addsubps
>> +; AVX: vxorps
>> +; AVX-NEXT: vaddsubps
>> +; CHECK: ret
>> +
>> +
>> +define <8 x float> @test6b(<8 x float> %A, <8 x float> %B) {
>> +  %sub = fsub <8 x float> %A, %B
>> +  %add = fadd <8 x float> %B, %A
>> +  %vecinit14 = shufflevector <8 x float> %add, <8 x float> %sub, <8 x
>> i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
>> +  ret <8 x float> %vecinit14
>> +}
>> +; CHECK-LABEL: test6
>> +; SSE: xorps
>> +; SSE-NEXT: addsubps
>> +; SSE: xorps
>> +; SSE-NEXT: addsubps
>> +; AVX: vxorps
>> +; AVX-NEXT: vaddsubps
>> +; AVX-NOT: vxorps
>> +; AVX-NOT: vaddsubps
>> +; CHECK: ret
>> +
>> +
>> +define <4 x double> @test7b(<4 x double> %A, <4 x double> %B) {
>> +  %sub = fsub <4 x double> %A, %B
>> +  %add = fadd <4 x double> %B, %A
>> +  %vecinit6 = shufflevector <4 x double> %add, <4 x double> %sub, <4 x
>> i32> <i32 0, i32 5, i32 2, i32 7>
>> +  ret <4 x double> %vecinit6
>> +}
>> +; CHECK-LABEL: test7
>> +; SSE: xorpd
>> +; SSE-NEXT: addsubpd
>> +; SSE: xorpd
>> +; SSE-NEXT: addsubpd
>> +; AVX: vxorpd
>> +; AVX-NEXT: vaddsubpd
>> +; AVX-NOT: vxorpd
>> +; AVX-NOT: vaddsubpd
>> +; CHECK: ret
>> +
>> +
>> +define <2 x double> @test8b(<2 x double> %A, <2 x double> %B) #0 {
>> +  %add = fadd <2 x double> %B, %A
>> +  %sub = fsub <2 x double> %A, %B
>> +  %vecinit2 = shufflevector <2 x double> %add, <2 x double> %sub, <2 x
>> i32> <i32 0, i32 3>
>> +  ret <2 x double> %vecinit2
>> +}
>> +; CHECK-LABEL: test8
>> +; SSE: xorpd
>> +; SSE-NEXT: addsubpd
>> +; AVX: vxorpd
>> +; AVX-NEXT: vaddsubpd
>> +; CHECK: ret
>> +
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>