[PATCH][X86] Improve the lowering of packed shifts by constant build_vector on non avx2 machines.

Tue Apr 15 11:05:00 PDT 2014

Hi Jim,

Here is a new version of the patch.
I added two more RUN lines to specifically test 'core2' (SSE) and
core-avx2 (AVX2).

Please let me know if ok to submit.

Thanks!
Andrea

On Tue, Apr 15, 2014 at 6:12 PM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi Jim,
> thanks for the review!
>
> On Tue, Apr 15, 2014 at 6:06 PM, Jim Grosbach <grosbach at apple.com> wrote:
>> Hi Andrea,
>>
>> This looks good to me. In fact, it significantly improves code for several of those test cases for both older (core2) and newer (Haswell) processors as well. This is all around goodness. Would you mind adding CHECK lines for SSE and AVX2 codegen to the test cases you’ve added? I’d hate to see us regress on these improvements because we get misled by the tests into thinking they’re only relevant to AVX1.
>
> Sure no problem, I am going to add more RUN lines to check SSE and AVX2 as well.
> I will upload a new patch soon.
>
> Cheers,
> Andrea
>
>>
>> -Jim
>>
>> On Apr 15, 2014, at 8:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>
>>> ping x2.
>>>
>>> Thanks,
>>> Andrea Di Biagio
>>>
>>> On Tue, Apr 8, 2014 at 3:38 PM, Andrea Di Biagio
>>> <andrea.dibiagio at gmail.com> wrote:
>>>> ping.
>>>>
>>>>
>>>> On Tue, Apr 1, 2014 at 2:49 AM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> This patch teaches the backend how to efficiently lower
>>>>> logical/arithmetic packed shift nodes by constant build_vector on
>>>>> non-avx2 machines.
>>>>>
>>>>> The x86 backend already knows how to efficiently lower a packed shift
>>>>> left by a constant build_vector into a vector multiply (instead of
>>>>> lowering it into a long sequence of scalar shifts).
>>>>> However, nothing is currently done in the case of other shifts by
>>>>> constant build_vector.
>>>>>
>>>>> This patch teaches the backend how to lower v4i32 and v8i16 shifts
>>>>> according to the following rules:
>>>>>
>>>>> 1.   VSHIFT  (v4i32 A), (build_vector <X, Y, Y, Y>)  --> MOVSS (VSHIFT
>>>>> A, (build_vector <Y,Y,Y,Y>)), (VSHIFT A, (build_vector <X,X,X,X>))
>>>>> 2.   VSHIFT  (v4i32 A), (build_vector <X, X, Y, Y>)  -->  (bitcast (
>>>>> MOVSD (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y>)), v2i64), (bitcast
>>>>> (VSHIFT A, (build_vector <X,X,X,X>)), v2i64), v4i32)
>>>>> 3.   VSHIFT  (v8i16 A), (build_vector <X, X, Y, Y, Y,Y,Y,Y>)  -->
>>>>> (bitcast (MOVSS (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y,Y,Y,Y,Y>)),
>>>>> v4i32), (bitcast (VSHIFT A, (build_vector <X,X,X,X,X,X,X,X>)),
>>>>> v4i32)), v8i16)
>>>>> 4.   VSHIFT  (v8i16 A), (build_vector <X, X, X, X,Y,Y,Y,Y>)  -->
>>>>> (bitcast (MOVSD (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y,Y,Y,Y,Y>)),
>>>>> v2i64), (bitcast (VSHIFT A, (build_vector <X,X,X,X,X,X,X,X>)),
>>>>> v2i64)), v8i16)
>>>>>
>>>>> Basically,
>>>>> instead of scalarizing a vector shift, we try to expand it into a
>>>>> sequence of two shifts by constant splat followed by a MOVSS/MOVSD.
>>>>>
>>>>> The following example:
>>>>> ///--
>>>>> define <8 x i16> @foo(<8 x i16> %a) {
>>>>>  %lshr = lshr <8 x i16> %a, <i16 3, i16 3, i16 2, i16 2, i16 2, i16
>>>>> 2, i16 2, i16 2>
>>>>>  ret <8 x i16> %lshr
>>>>> }
>>>>> ///--
>>>>>
>>>>> Before this patch, for targets with no AVX2 support, the backend
>>>>> always scalarized the logical packed shift right in function @foo into
>>>>> a very long sequence of instructions (8 scalar shifts + 8 inserts + 8
>>>>> extracts).
>>>>> With this patch, the backend produces only three instructions (here is
>>>>> the output on a -mcpu=corei7-avx):
>>>>>   vpsrlw $2, %xmm0, %xmm1
>>>>>   vpsrlw $3  %xmm0, %xmm0
>>>>>   vmovss %xmm0, %xmm1, %xmm0
>>>>>   retq
>>>>>
>>>>> Other examples can be found in the new test test/CodeGen/X86/lower-vec-shift.ll.
>>>>>
>>>>> For now, I decided to simply cover the above mentioned cases.
>>>>> However, future patches could address even more patterns taking
>>>>> advantage of other legal ways to do a blend of two vector shifts.
>>>>>
>>>>> Please let me know if ok to submit.
>>>>>
>>>>>
>>>>> Thanks!
>>>>> Andrea Di Biagio
>>>>> SN Systems - Sony Computer Entertainment Group
>>> <patch-lower-vector-shifts.diff>
>>
-------------- next part --------------
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================

--- lib/Target/X86/X86ISelLowering.cpp	(revision 206305)
+++ lib/Target/X86/X86ISelLowering.cpp	(working copy)
@@ -13359,6 +13359,79 @@
     return DAG.getNode(ISD::MUL, dl, VT, Op, R);
   }
 
+  // If possible, lower this shift as a sequence of two shifts by
+  // constant plus a MOVSS/MOVSD instead of scalarizing it.
+  // Example:
+  //   (v4i32 (srl A, (build_vector < X, Y, Y, Y>)))
+  //
+  // Could be rewritten as:
+  //   (v4i32 (MOVSS (srl A, <Y,Y,Y,Y>), (srl A, <X,X,X,X>)))
+  //
+  // The advantage is that the two shifts from the example would be
+  // lowered as X86ISD::VSRLI nodes. This would be cheaper than scalarizing
+  // the vector shift into four scalar shifts plus four pairs of vector
+  // insert/extract.
+  if ((VT == MVT::v8i16 || VT == MVT::v4i32) &&
+      ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {
+    unsigned TargetOpcode = X86ISD::MOVSS;
+    bool CanBeSimplified;
+    // The splat value for the first packed shift (the 'X' from the example).
+    SDValue Amt1 = Amt->getOperand(0);
+    // The splat value for the second packed shift (the 'Y' from the example).
+    SDValue Amt2 = (VT == MVT::v4i32) ? Amt->getOperand(1) :
+                                        Amt->getOperand(2);
+
+    // See if it is possible to replace this node with a sequence of
+    // two shifts followed by a MOVSS/MOVSD
+    if (VT == MVT::v4i32) {
+      // Check if it is legal to use a MOVSS.
+      CanBeSimplified = Amt2 == Amt->getOperand(2) &&
+                        Amt2 == Amt->getOperand(3);
+      if (!CanBeSimplified) {
+        // Otherwise, check if we can still simplify this node using a MOVSD.
+        CanBeSimplified = Amt1 == Amt->getOperand(1) &&
+                          Amt->getOperand(2) == Amt->getOperand(3);
+        TargetOpcode = X86ISD::MOVSD;
+        Amt2 = Amt->getOperand(2);
+      }
+    } else {
+      // Do similar checks for the case where the machine value type
+      // is MVT::v8i16.
+      CanBeSimplified = Amt1 == Amt->getOperand(1);
+      for (unsigned i=3; i != 8 && CanBeSimplified; ++i)
+        CanBeSimplified = Amt2 == Amt->getOperand(i);
+
+      if (!CanBeSimplified) {
+        TargetOpcode = X86ISD::MOVSD;
+        CanBeSimplified = true;
+        Amt2 = Amt->getOperand(4);
+        for (unsigned i=0; i != 4 && CanBeSimplified; ++i)
+          CanBeSimplified = Amt1 == Amt->getOperand(i);
+        for (unsigned j=4; j != 8 && CanBeSimplified; ++j)
+          CanBeSimplified = Amt2 == Amt->getOperand(j);
+      }
+    }
+    
+    if (CanBeSimplified && isa<ConstantSDNode>(Amt1) &&
+        isa<ConstantSDNode>(Amt2)) {
+      // Replace this node with two shifts followed by a MOVSS/MOVSD.
+      EVT CastVT = MVT::v4i32;
+      SDValue Splat1 = 
+        DAG.getConstant(cast<ConstantSDNode>(Amt1)->getAPIntValue(), VT);
+      SDValue Shift1 = DAG.getNode(Op->getOpcode(), dl, VT, R, Splat1);
+      SDValue Splat2 = 
+        DAG.getConstant(cast<ConstantSDNode>(Amt2)->getAPIntValue(), VT);
+      SDValue Shift2 = DAG.getNode(Op->getOpcode(), dl, VT, R, Splat2);
+      if (TargetOpcode == X86ISD::MOVSD)
+        CastVT = MVT::v2i64;
+      SDValue BitCast1 = DAG.getNode(ISD::BITCAST, dl, CastVT, Shift1);
+      SDValue BitCast2 = DAG.getNode(ISD::BITCAST, dl, CastVT, Shift2);
+      SDValue Result = getTargetShuffleNode(TargetOpcode, dl, CastVT, BitCast2,
+                                            BitCast1, DAG);
+      return DAG.getNode(ISD::BITCAST, dl, VT, Result);
+    }
+  }
+
   if (VT == MVT::v16i8 && Op->getOpcode() == ISD::SHL) {
     assert(Subtarget->hasSSE2() && "Need SSE2 for pslli/pcmpeq.");
 
Index: test/CodeGen/X86/lower-vec-shift.ll
===================================================================
--- test/CodeGen/X86/lower-vec-shift.ll	(revision 0)
+++ test/CodeGen/X86/lower-vec-shift.ll	(working copy)
@@ -0,0 +1,125 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7-avx | FileCheck %s --check-prefix=CHECK --check-prefix=AVX
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=core2 | FileCheck %s --check-prefix=CHECK --check-prefix=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=core-avx2 | FileCheck %s --check-prefix=CHECK --check-prefix=AVX2
+
+
+; Verify that the following shifts are lowered into a sequence of two shifts plus
+; a blend. On pre-avx2 targets, instead of scalarizing logical and arithmetic
+; packed shift right by a constant build_vector the backend should always try to
+; emit a simpler sequence of two shifts + blend when possible.
+
+define <8 x i16> @test1(<8 x i16> %a) {
+  %lshr = lshr <8 x i16> %a, <i16 3, i16 3, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
+  ret <8 x i16> %lshr
+}
+; CHECK-LABEL: test1
+; SSE: psrlw
+; SSE-NEXT: psrlw
+; SSE-NEXT: movss
+; AVX: vpsrlw
+; AVX-NEXT: vpsrlw
+; AVX-NEXT: vmovss
+; AVX2: vpsrlw
+; AVX2-NEXT: vpsrlw
+; AVX2-NEXT: vmovss
+; CHECK: ret
+
+
+define <8 x i16> @test2(<8 x i16> %a) {
+  %lshr = lshr <8 x i16> %a, <i16 3, i16 3, i16 3, i16 3, i16 2, i16 2, i16 2, i16 2>
+  ret <8 x i16> %lshr
+}
+; CHECK-LABEL: test2
+; SSE: psrlw
+; SSE-NEXT: psrlw
+; SSE-NEXT: movsd
+; AVX: vpsrlw
+; AVX-NEXT: vpsrlw
+; AVX-NEXT: vmovsd
+; AVX2: vpsrlw
+; AVX2-NEXT: vpsrlw
+; AVX2-NEXT: vmovsd
+; CHECK: ret
+
+
+define <4 x i32> @test3(<4 x i32> %a) {
+  %lshr = lshr <4 x i32> %a, <i32 3, i32 2, i32 2, i32 2>
+  ret <4 x i32> %lshr
+}
+; CHECK-LABEL: test3
+; SSE: psrld
+; SSE-NEXT: psrld
+; SSE-NEXT: movss
+; AVX: vpsrld
+; AVX-NEXT: vpsrld
+; AVX-NEXT: vmovss
+; AVX2: vpsrlvd
+; CHECK: ret
+
+
+define <4 x i32> @test4(<4 x i32> %a) {
+  %lshr = lshr <4 x i32> %a, <i32 3, i32 3, i32 2, i32 2>
+  ret <4 x i32> %lshr
+}
+; CHECK-LABEL: test4
+; SSE: psrld
+; SSE-NEXT: psrld
+; SSE-NEXT: movsd
+; AVX: vpsrld
+; AVX-NEXT: vpsrld
+; AVX-NEXT: vmovsd
+; AVX2: vpsrlvd
+; CHECK: ret
+
+
+define <8 x i16> @test5(<8 x i16> %a) {
+  %lshr = ashr <8 x i16> %a, <i16 3, i16 3, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
+  ret <8 x i16> %lshr
+}
+
+define <8 x i16> @test6(<8 x i16> %a) {
+  %lshr = ashr <8 x i16> %a, <i16 3, i16 3, i16 3, i16 3, i16 2, i16 2, i16 2, i16 2>
+  ret <8 x i16> %lshr
+}
+; CHECK-LABEL: test6
+; SSE: psraw
+; SSE-NEXT: psraw
+; SSE-NEXT: movsd
+; AVX: vpsraw
+; AVX-NEXT: vpsraw
+; AVX-NEXT: vmovsd
+; AVX2: vpsraw
+; AVX2-NEXT: vpsraw
+; AVX2-NEXT: vmovsd
+; CHECK: ret
+
+
+define <4 x i32> @test7(<4 x i32> %a) {
+  %lshr = ashr <4 x i32> %a, <i32 3, i32 2, i32 2, i32 2>
+  ret <4 x i32> %lshr
+}
+; CHECK-LABEL: test7
+; SSE: psrad
+; SSE-NEXT: psrad
+; SSE-NEXT: movss
+; AVX: vpsrad
+; AVX-NEXT: vpsrad
+; AVX-NEXT: vmovss
+; AVX2: vpsravd
+; CHECK: ret
+
+
+define <4 x i32> @test8(<4 x i32> %a) {
+  %lshr = ashr <4 x i32> %a, <i32 3, i32 3, i32 2, i32 2>
+  ret <4 x i32> %lshr
+}
+; CHECK-LABEL: test8
+; SSE: psrad
+; SSE-NEXT: psrad
+; SSE-NEXT: movsd
+; AVX: psrad
+; AVX-NEXT: psrad
+; AVX-NEXT: movsd
+; AVX2: vpsravd
+; CHECK: ret
+