[PATCH][X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.

Fri Feb 7 12:51:40 PST 2014

Hi Jim and Nadav,

here is a new version of the patch.

The cost model is able to deal with most of the cases where the second
operand of a SHL is of kind OK_UniformConstantValue.
As far as I can see, the only missing rule is for the case where we
have a v16i16 SHL in AVX2.
I added a rule in X86TTI::getArithmeticInstrCost to update the cost
for that specific case.

One of the problems I have encountered is that method 'getOperandInfo'
in CostModel either returns OK_AnyValue or OK_UniformConstantValue (I
didn't find any case where we return OK_UniformValue..).
All the shifts optimized by my patch are shifts where the second
operand is a ConstantDataVector but not a splat. Therefore, function
getOperandInfo in CostModel.cpp will always return OK_AnyValue for
them.

In conclusion, I am not sure if I have modified that part correctly.
Please let me know if you had something else in mind and in case what
should be the correct way to improve that part.

I investigated other opportunties for applying this transformation to
other vector types.
This new patch improves my previous patch since we now know how to
expand a packed v16i16 shift left into a single AVX2 vpmullw when the
shift amount is a constant build_vector.

Without AVX2 support, a packed v16i16 shift left is usually decomposed
into two 128-bits shifts. Each new shift is then properly expanded
into a multiply instruction (pmullw) thanks to the new transformation
introduced by this patch.

The backend already knows how to efficiently lower v8i32 shifts with
AVX2: v8i32 shifts are expanded into VPSLLV instructions.
Also, the backend already knows how to emit the AVX512f version of
VPSLLVD and VPSLLVQ in the case of v1632 and v8i64 vectors.
AVX512f seems to benefit alot from this new transformation (you can
see it if you compare the output of test avx_shift6.ll when run for
AVX512 before and after applying my patch).

A vector shift left by constant v32i16 build_vector is initally split
into two v16i16 shifts during type-legalization. Eventually the new
algorithm converts the resulting shifts into multiply instructions.

See new test-cases avx_shift6.ll for more details.

Please let me know what you think.

Thanks,
Andrea

On Thu, Feb 6, 2014 at 11:11 PM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi Jim and Nadav,
>
> thanks for the feedback!
>
> On Thu, Feb 6, 2014 at 10:33 PM, Nadav Rotem <nrotem at apple.com> wrote:
>> Please remember to update the x86 cost model in  X86TTI::getArithmeticInstrCost.  In this function you should be able to check if the LHS is a constant.
>
> Ok, I will update the cost model.
>
>> On Feb 6, 2014, at 2:26 PM, Jim Grosbach <grosbach at apple.com> wrote:
>>
>>> Hi Andrea,
>>>
>>> This is a very nice improvement, but should do a bit more, I believe.
>>>
>>> AVX2 adds 256-bit wide vector versions of these instructions, so if AVX2 is available, the same transformation should be applied to v16i16 and v8i32 shifts. Worth looking to see if AVX512 extends
>>>
>>> The test cases should check that when compiling for AVX, the VEX prefixed form of the instructions are generated instead of the SSE versions.
>>>
>
> Sure, I will send a new version of the patch that also improves AVX2
> code generation in the case of v16i16 and v8i32. I'll also have a look
> at AVX512 to identify cases that can also be improved.
>
> On my previous email I forgot to mention that this patch would also
> fix bug 18478 (http://llvm.org/bugs/show_bug.cgi?id=18478)
>
> Thanks again for your time.
> Andrea
>
>>
>>>> Hi,
>>>>
>>>> This patch teaches the backend how to efficiently lower a packed
>>>> vector shift left into a packed vector multiply if the vector of shift
>>>> counts is known to be constant (i.e. a constant build_vector).
>>>>
>>>> Instead of expanding a packed shift into a sequence of scalar shifts,
>>>> the backend should try (when possible) to convert the vector shift
>>>> into a vector multiply.
>>>>
>>>> Before this patch, a shift of a MVT::v8i16 vector by a build_vector of
>>>> constants was always scalarized into a long sequence of "vector
>>>> extracts + scalar shifts + vector insert".
>>>> With this patch, if there is SSE2 support, we emit a single vector multiply.
>>>>
>>>> The new x86 test 'vec_shift6.ll' contains some examples of code that
>>>> are affected by this patch.
>>>>
>>>> Please let me know if ok to submit.
>>>>
>>>> Thanks,
>>>> Andrea Di Biagio
>>>> SN Systems - Sony Computer Entertainment Group
>>>> <patch.diff>_______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>
-------------- next part --------------
Index: test/CodeGen/X86/avx-shift.ll
===================================================================

--- test/CodeGen/X86/avx-shift.ll	(revision 200990)
+++ test/CodeGen/X86/avx-shift.ll	(working copy)
@@ -115,8 +115,8 @@
 ; PR15141
 ; CHECK: _vshift13:
 ; CHECK-NOT: vpsll
-; CHECK: vcvttps2dq
-; CHECK-NEXT: vpmulld
+; CHECK-NOT: vcvttps2dq
+; CHECK: vpmulld
 define <4 x i32> @vshift13(<4 x i32> %in) {
   %T = shl <4 x i32> %in, <i32 0, i32 1, i32 2, i32 4>
   ret <4 x i32> %T
Index: test/CodeGen/X86/vec_shift6.ll
===================================================================
--- test/CodeGen/X86/vec_shift6.ll	(revision 0)
+++ test/CodeGen/X86/vec_shift6.ll	(revision 0)
@@ -0,0 +1,132 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 | FileCheck %s -check-prefix=CHECK -check-prefix=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 -mattr=+avx2 | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2 -check-prefix=AVX2ONLY
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 -mattr=+avx2,+avx512f | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2 -check-prefix=AVX512
+
+
+; Verify that we don't scalarize a packed vector shift left of 16-bit
+; signed integers if the amount is a constant build_vector.
+; Check that we produce a SSE2 packed integer multiply (pmullw) instead.
+
+define <8 x i16> @test1(<8 x i16> %a) {
+  %shl = shl <8 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <8 x i16> %shl
+}
+; CHECK-LABEL: test1
+; CHECK: pmullw
+; CHECK-NEXT: ret
+
+
+define <8 x i16> @test2(<8 x i16> %a) {
+  %shl = shl <8 x i16> %a, <i16 0, i16 undef, i16 0, i16 0, i16 1, i16 undef, i16 -1, i16 1>
+  ret <8 x i16> %shl
+}
+; CHECK-LABEL: test2
+; CHECK: pmullw
+; CHECK-NEXT: ret
+
+
+; Verify that a vector shift left of 32-bit signed integers is simply expanded
+; into a SSE4.1 pmulld (instead of cvttps2dq + pmulld) if the vector of shift
+; counts is a constant build_vector.
+
+define <4 x i32> @test3(<4 x i32> %a) {
+  %shl = shl <4 x i32> %a, <i32 1, i32 -1, i32 2, i32 -3>
+  ret <4 x i32> %shl
+}
+; CHECK-LABEL: test3
+; CHECK-NOT: cvttps2dq
+; SSE: pmulld
+; AVX2: vpsllvd
+; CHECK-NEXT: ret
+
+
+define <4 x i32> @test4(<4 x i32> %a) {
+  %shl = shl <4 x i32> %a, <i32 0, i32 0, i32 1, i32 1>
+  ret <4 x i32> %shl
+}
+; CHECK-LABEL: test4
+; CHECK-NOT: cvttps2dq
+; SSE: pmulld
+; AVX2: vpsllvd
+; CHECK-NEXT: ret
+
+
+; If we have AVX/SSE2 but not AVX2, verify that the following shift is split
+; into two pmullw instructions. With AVX2, the test case below would produce
+; a single vpmullw.
+
+define <16 x i16> @test5(<16 x i16> %a) {
+  %shl = shl <16 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <16 x i16> %shl
+}
+; CHECK-LABEL: test5
+; SSE: pmullw
+; SSE-NEXT: pmullw
+; AVX2: vpmullw
+; AVX2-NOT: vpmullw
+; CHECK: ret
+
+
+; If we have AVX/SSE4.1 but not AVX2, verify that the following shift is split
+; into two pmulld instructions. With AVX2, the test case below would produce
+; a single vpsllvd instead.
+
+define <8 x i32> @test6(<8 x i32> %a) {
+  %shl = shl <8 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+  ret <8 x i32> %shl
+}
+; CHECK-LABEL: test6
+; SSE: pmulld
+; SSE-NEXT: pmulld
+; AVX2: vpsllvd
+; CHECK: ret
+
+
+; With AVX2 and AVX512, the test case below should produce a sequence of
+; two vpmullw instructions. On SSE2 instead, we split the shift in four
+; parts and then we convert each part into a pmullw.
+
+define <32 x i16> @test7(<32 x i16> %a) {
+  %shl = shl <32 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <32 x i16> %shl
+}
+; CHECK-LABEL: test7
+; SSE: pmullw
+; SSE-NEXT: pmullw
+; SSE-NEXT: pmullw
+; SSE-NEXT: pmullw
+; AVX2: vpmullw
+; AVX2-NEXT: vpmullw
+; CHECK: ret
+
+
+; Similar to test7; the difference is that with AVX512 support
+; we only produce a single vpsllvd/vpsllvq instead of a pair of vpsllvd/vpsllvq.
+
+define <16 x i32> @test8(<16 x i32> %a) {
+  %shl = shl <16 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+  ret <16 x i32> %shl
+}
+; CHECK-LABEL: test8
+; SSE: pmulld
+; SSE-NEXT: pmulld
+; SSE-NEXT: pmulld
+; SSE-NEXT: pmulld
+; AVX2ONLY: vpsllvd
+; AVX2ONLY-NEXT: vpsllvd
+; AVX512: vpsllvd
+; AVX512-NOT: vpsllvd
+; CHECK: ret
+
+
+define <8 x i64> @test9(<8 x i64> %a) {
+  %shl = shl <8 x i64> %a, <i64 1, i64 1, i64 2, i64 3, i64 1, i64 1, i64 2, i64 3>
+  ret <8 x i64> %shl
+}
+; CHECK-LABEL: test9
+; AVX2ONLY: vpsllvq
+; AVX2ONLY-NEXT: vpsllvq
+; AVX512: vpsllvq
+; AVX512-NOT: vpsllvq
+; CHECK: ret
+
Index: lib/Target/X86/X86TargetTransformInfo.cpp
===================================================================
--- lib/Target/X86/X86TargetTransformInfo.cpp	(revision 200990)
+++ lib/Target/X86/X86TargetTransformInfo.cpp	(working copy)
@@ -225,6 +225,14 @@
 
   // Look for AVX2 lowering tricks.
   if (ST->hasAVX2()) {
+    if (ISD == ISD::SHL &&
+        Op2Info == TargetTransformInfo::OK_UniformConstantValue) {
+      // On AVX2, a packed v16i16 shift left by a constant build_vector
+      // is lowered into a vector multiply (vpmullw).
+      if (LT.second == MVT::v16i16)
+        return LT.first;
+    }
+
     int Idx = CostTableLookup(AVX2CostTable, ISD, LT.second);
     if (Idx != -1)
       return LT.first * AVX2CostTable[Idx].Cost;
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================
--- lib/Target/X86/X86ISelLowering.cpp	(revision 200990)
+++ lib/Target/X86/X86ISelLowering.cpp	(working copy)
@@ -13152,6 +13152,39 @@
       return Op;
   }
 
+  // If possible, lower this packed shift into a vector multiply instead of
+  // expanding it into a sequence of scalar shifts.
+  // Do this only if the vector shift count is a constant build_vector.
+  if (Op.getOpcode() == ISD::SHL && 
+      (VT == MVT::v8i16 || VT == MVT::v4i32 ||
+       (Subtarget->hasInt256() && VT == MVT::v16i16)) &&
+      ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {
+    SmallVector<SDValue, 8> Elts;
+    EVT SVT = VT.getScalarType();
+    unsigned SVTBits = SVT.getSizeInBits();
+    const APInt &One = APInt(SVTBits, 1);
+    unsigned NumElems = VT.getVectorNumElements();
+
+    for (unsigned i=0; i !=NumElems; ++i) {
+      SDValue Op = Amt->getOperand(i);
+      if (Op->getOpcode() == ISD::UNDEF) {
+        Elts.push_back(Op);
+        continue;
+      }
+
+      ConstantSDNode *ND = cast<ConstantSDNode>(Op);
+      const APInt &C = APInt(SVTBits, ND->getAPIntValue().getZExtValue());
+      uint64_t ShAmt = C.getZExtValue();
+      if (ShAmt >= SVTBits) {
+        Elts.push_back(DAG.getUNDEF(SVT));
+        continue;
+      }
+      Elts.push_back(DAG.getConstant(One.shl(ShAmt), SVT));
+    }
+    SDValue BV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, &Elts[0], NumElems);
+    return DAG.getNode(ISD::MUL, dl, VT, R, BV);
+  }
+ 
   // Lower SHL with variable shift amount.
   if (VT == MVT::v4i32 && Op->getOpcode() == ISD::SHL) {
     Op = DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(23, VT));