[PATCH][X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.

Tue Feb 11 12:29:16 PST 2014

Hi Nadav and Jim,

Sorry for sending another version of the patch but I think I found how
to properly fix/improve the cost model.
For simplicity I have split the patch into two: a patch that
introduces the new lowering rule for vector shifts and a patch that
improves the x86 cost model.

Test vec_shift6.ll should now cover all the cases where vector shifts
are expanded into multiply instructions.

I introduced into the cost model a new 'OperandValueKind' to identify
operands which are constants but not constant splats.
I called it 'OK_NonUniformConstValue' to distinguish it from
'OK_UniformConstantValue' which is used for splat values.

I modified CostModel.cpp to allow returning 'OK_NonUniformConstValue'
for non splat operands that are instances of ConstantVector or
ConstantDataVector.

I verified that the cost model still produces the expected results on
other non x86 targets (ARM and PPC).
On the X86 backend, I modified X86TargetTransformInfo to produce the
'expected' cost for vector shift left instructions that can be lowered
as a vector multiply.

Finally I added a new test to verify that the output of opt
'-cost-model -analyze' is valid in the following configurations: SSE2,
SSE4.1, AVX, AVX2.

I didn't add a cost model for AVX512f, but I think (if it is ok for
you) that this can be done as a separate patch.

Please let me know what you think.

Thanks,
Andrea

On Fri, Feb 7, 2014 at 8:51 PM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi Jim and Nadav,
>
> here is a new version of the patch.
>
> The cost model is able to deal with most of the cases where the second
> operand of a SHL is of kind OK_UniformConstantValue.
> As far as I can see, the only missing rule is for the case where we
> have a v16i16 SHL in AVX2.
> I added a rule in X86TTI::getArithmeticInstrCost to update the cost
> for that specific case.
>
> One of the problems I have encountered is that method 'getOperandInfo'
> in CostModel either returns OK_AnyValue or OK_UniformConstantValue (I
> didn't find any case where we return OK_UniformValue..).
> All the shifts optimized by my patch are shifts where the second
> operand is a ConstantDataVector but not a splat. Therefore, function
> getOperandInfo in CostModel.cpp will always return OK_AnyValue for
> them.
>
> In conclusion, I am not sure if I have modified that part correctly.
> Please let me know if you had something else in mind and in case what
> should be the correct way to improve that part.
>
> I investigated other opportunties for applying this transformation to
> other vector types.
> This new patch improves my previous patch since we now know how to
> expand a packed v16i16 shift left into a single AVX2 vpmullw when the
> shift amount is a constant build_vector.
>
> Without AVX2 support, a packed v16i16 shift left is usually decomposed
> into two 128-bits shifts. Each new shift is then properly expanded
> into a multiply instruction (pmullw) thanks to the new transformation
> introduced by this patch.
>
> The backend already knows how to efficiently lower v8i32 shifts with
> AVX2: v8i32 shifts are expanded into VPSLLV instructions.
> Also, the backend already knows how to emit the AVX512f version of
> VPSLLVD and VPSLLVQ in the case of v1632 and v8i64 vectors.
> AVX512f seems to benefit alot from this new transformation (you can
> see it if you compare the output of test avx_shift6.ll when run for
> AVX512 before and after applying my patch).
>
> A vector shift left by constant v32i16 build_vector is initally split
> into two v16i16 shifts during type-legalization. Eventually the new
> algorithm converts the resulting shifts into multiply instructions.
>
> See new test-cases avx_shift6.ll for more details.
>
> Please let me know what you think.
>
> Thanks,
> Andrea
>
> On Thu, Feb 6, 2014 at 11:11 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>> Hi Jim and Nadav,
>>
>> thanks for the feedback!
>>
>> On Thu, Feb 6, 2014 at 10:33 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>> Please remember to update the x86 cost model in  X86TTI::getArithmeticInstrCost.  In this function you should be able to check if the LHS is a constant.
>>
>> Ok, I will update the cost model.
>>
>>> On Feb 6, 2014, at 2:26 PM, Jim Grosbach <grosbach at apple.com> wrote:
>>>
>>>> Hi Andrea,
>>>>
>>>> This is a very nice improvement, but should do a bit more, I believe.
>>>>
>>>> AVX2 adds 256-bit wide vector versions of these instructions, so if AVX2 is available, the same transformation should be applied to v16i16 and v8i32 shifts. Worth looking to see if AVX512 extends
>>>>
>>>> The test cases should check that when compiling for AVX, the VEX prefixed form of the instructions are generated instead of the SSE versions.
>>>>
>>
>> Sure, I will send a new version of the patch that also improves AVX2
>> code generation in the case of v16i16 and v8i32. I'll also have a look
>> at AVX512 to identify cases that can also be improved.
>>
>> On my previous email I forgot to mention that this patch would also
>> fix bug 18478 (http://llvm.org/bugs/show_bug.cgi?id=18478)
>>
>> Thanks again for your time.
>> Andrea
>>
>>>
>>>>> Hi,
>>>>>
>>>>> This patch teaches the backend how to efficiently lower a packed
>>>>> vector shift left into a packed vector multiply if the vector of shift
>>>>> counts is known to be constant (i.e. a constant build_vector).
>>>>>
>>>>> Instead of expanding a packed shift into a sequence of scalar shifts,
>>>>> the backend should try (when possible) to convert the vector shift
>>>>> into a vector multiply.
>>>>>
>>>>> Before this patch, a shift of a MVT::v8i16 vector by a build_vector of
>>>>> constants was always scalarized into a long sequence of "vector
>>>>> extracts + scalar shifts + vector insert".
>>>>> With this patch, if there is SSE2 support, we emit a single vector multiply.
>>>>>
>>>>> The new x86 test 'vec_shift6.ll' contains some examples of code that
>>>>> are affected by this patch.
>>>>>
>>>>> Please let me know if ok to submit.
>>>>>
>>>>> Thanks,
>>>>> Andrea Di Biagio
>>>>> SN Systems - Sony Computer Entertainment Group
>>>>> <patch.diff>_______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>
-------------- next part --------------
Index: include/llvm/Analysis/TargetTransformInfo.h
===================================================================

--- include/llvm/Analysis/TargetTransformInfo.h	(revision 201171)
+++ include/llvm/Analysis/TargetTransformInfo.h	(working copy)
@@ -321,9 +321,10 @@
 
   /// \brief Additional information about an operand's possible values.
   enum OperandValueKind {
-    OK_AnyValue,            // Operand can have any value.
-    OK_UniformValue,        // Operand is uniform (splat of a value).
-    OK_UniformConstantValue // Operand is uniform constant.
+    OK_AnyValue,                 // Operand can have any value.
+    OK_UniformValue,             // Operand is uniform (splat of a value).
+    OK_UniformConstantValue,     // Operand is uniform constant.
+    OK_NonUniformConstantValue   // Operand is a non uniform constant value.
   };
 
   /// \return The number of scalar or vector registers that the target has.
Index: lib/Analysis/CostModel.cpp
===================================================================
--- lib/Analysis/CostModel.cpp	(revision 201171)
+++ lib/Analysis/CostModel.cpp	(working copy)
@@ -98,15 +98,20 @@
   TargetTransformInfo::OperandValueKind OpInfo =
     TargetTransformInfo::OK_AnyValue;
 
-  // Check for a splat of a constant.
+  // Check for a splat of a constant or for a non uniform vector of constants.
   ConstantDataVector *CDV = 0;
-  if ((CDV = dyn_cast<ConstantDataVector>(V)))
+  if ((CDV = dyn_cast<ConstantDataVector>(V))) {
+    OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
     if (CDV->getSplatValue() != NULL)
       OpInfo = TargetTransformInfo::OK_UniformConstantValue;
+  }
+
   ConstantVector *CV = 0;
-  if ((CV = dyn_cast<ConstantVector>(V)))
+  if ((CV = dyn_cast<ConstantVector>(V))) {
+    OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
     if (CV->getSplatValue() != NULL)
       OpInfo = TargetTransformInfo::OK_UniformConstantValue;
+  }
 
   return OpInfo;
 }
Index: lib/Target/X86/X86TargetTransformInfo.cpp
===================================================================
--- lib/Target/X86/X86TargetTransformInfo.cpp	(revision 201171)
+++ lib/Target/X86/X86TargetTransformInfo.cpp	(working copy)
@@ -225,6 +225,13 @@
 
   // Look for AVX2 lowering tricks.
   if (ST->hasAVX2()) {
+    if (ISD == ISD::SHL && LT.second == MVT::v16i16 &&
+        (Op2Info == TargetTransformInfo::OK_UniformConstantValue ||
+         Op2Info == TargetTransformInfo::OK_NonUniformConstantValue))
+      // On AVX2, a packed v16i16 shift left by a constant build_vector
+      // is lowered into a vector multiply (vpmullw).
+      return LT.first;
+
     int Idx = CostTableLookup(AVX2CostTable, ISD, LT.second);
     if (Idx != -1)
       return LT.first * AVX2CostTable[Idx].Cost;
@@ -257,6 +264,20 @@
       return LT.first * SSE2UniformConstCostTable[Idx].Cost;
   }
 
+  if (ISD == ISD::SHL &&
+      Op2Info == TargetTransformInfo::OK_NonUniformConstantValue) {
+    EVT VT = LT.second;
+    if ((VT == MVT::v8i16 && ST->hasSSE2()) ||
+        (VT == MVT::v4i32 && ST->hasSSE41()))
+      // Vector shift left by non uniform constant can be lowered
+      // into vector multiply (pmullw/pmulld).
+      return LT.first;
+    if (VT == MVT::v4i32 && ST->hasSSE2())
+      // A vector shift left by non uniform constant is converted
+      // into a vector multiply; the new multiply is eventually
+      // lowered into a sequence of shuffles and 2 x pmuludq.
+      ISD = ISD::MUL;
+  }
 
   static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
     // We don't correctly identify costs of casts because they are marked as
@@ -271,6 +292,7 @@
     { ISD::SHL,  MVT::v8i16,  8*10 }, // Scalarized.
     { ISD::SHL,  MVT::v4i32,  2*5 }, // We optimized this using mul.
     { ISD::SHL,  MVT::v2i64,  2*10 }, // Scalarized.
+    { ISD::SHL,  MVT::v4i64,  4*10 }, // Scalarized.
 
     { ISD::SRL,  MVT::v16i8,  16*10 }, // Scalarized.
     { ISD::SRL,  MVT::v8i16,  8*10 }, // Scalarized.
@@ -308,6 +330,7 @@
     // We don't have to scalarize unsupported ops. We can issue two half-sized
     // operations and we only need to extract the upper YMM half.
     // Two ops + 1 extract + 1 insert = 4.
+    { ISD::MUL,     MVT::v16i16,   4 },
     { ISD::MUL,     MVT::v8i32,    4 },
     { ISD::SUB,     MVT::v8i32,    4 },
     { ISD::ADD,     MVT::v8i32,    4 },
@@ -323,6 +346,14 @@
 
   // Look for AVX1 lowering tricks.
   if (ST->hasAVX() && !ST->hasAVX2()) {
+    EVT VT = LT.second;
+
+    // v16i16 and v8i32 shifts by non-uniform constants are lowered into a
+    // sequence of extract + two vector multiply + insert.
+    if (ISD == ISD::SHL && (VT == MVT::v8i32 || VT == MVT::v16i16) &&
+        Op2Info == TargetTransformInfo::OK_NonUniformConstantValue)
+      ISD = ISD::MUL;
+
     int Idx = CostTableLookup(AVX1CostTable, ISD, LT.second);
     if (Idx != -1)
       return LT.first * AVX1CostTable[Idx].Cost;
@@ -343,7 +374,7 @@
   // 2x pmuludq, 2x shuffle.
   if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&
       !ST->hasSSE41())
-    return 6;
+    return LT.first * 6;
 
   // Fallback to the default implementation.
   return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty, Op1Info,
Index: test/Analysis/CostModel/X86/vshift-cost.ll
===================================================================
--- test/Analysis/CostModel/X86/vshift-cost.ll	(revision 0)
+++ test/Analysis/CostModel/X86/vshift-cost.ll	(revision 0)
@@ -0,0 +1,167 @@
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 -mattr=+sse2,-sse4.1 -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=SSE2
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=SSE41
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7-avx -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=AVX
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=core-avx2 -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2
+
+
+; Verify the cost of vector shift left instructions.
+
+; We always emit a single pmullw in the case of v8i16 vector shifts by
+; non-uniform constant.
+
+define <8 x i16> @test1(<8 x i16> %a) {
+  %shl = shl <8 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <8 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test1':
+; CHECK: Found an estimated cost of 1 for instruction:   %shl
+
+
+define <8 x i16> @test2(<8 x i16> %a) {
+  %shl = shl <8 x i16> %a, <i16 0, i16 undef, i16 0, i16 0, i16 1, i16 undef, i16 -1, i16 1>
+  ret <8 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test2':
+; CHECK: Found an estimated cost of 1 for instruction:   %shl
+
+
+; With SSE4.1, v4i32 shifts can be lowered into a single pmulld instruction.
+; Make sure that the estimated cost is always 1 except for the case where
+; we only have SSE2 support. With SSE2, we are forced to special lower the
+; v4i32 mul as a 2x shuffle, 2x pmuludq, 2x shuffle.
+
+define <4 x i32> @test3(<4 x i32> %a) {
+  %shl = shl <4 x i32> %a, <i32 1, i32 -1, i32 2, i32 -3>
+  ret <4 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test3':
+; SSE2: Found an estimated cost of 6 for instruction:   %shl
+; SSE41: Found an estimated cost of 1 for instruction:   %shl
+; AVX: Found an estimated cost of 1 for instruction:   %shl
+; AVX2: Found an estimated cost of 1 for instruction:   %shl
+
+
+define <4 x i32> @test4(<4 x i32> %a) {
+  %shl = shl <4 x i32> %a, <i32 0, i32 0, i32 1, i32 1>
+  ret <4 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test4':
+; SSE2: Found an estimated cost of 6 for instruction:   %shl
+; SSE41: Found an estimated cost of 1 for instruction:   %shl
+; AVX: Found an estimated cost of 1 for instruction:   %shl
+; AVX2: Found an estimated cost of 1 for instruction:   %shl
+
+
+; On AVX2 we are able to lower the following shift into a single
+; vpsllvq. Therefore, the expected cost is only 1.
+; In all other cases, this shift is scalarized as the target does not support
+; vpsllv instructions.
+
+define <2 x i64> @test5(<2 x i64> %a) {
+  %shl = shl <2 x i64> %a, <i64 2, i64 3>
+  ret <2 x i64> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test5':
+; SSE2: Found an estimated cost of 20 for instruction:   %shl
+; SSE41: Found an estimated cost of 20 for instruction:   %shl
+; AVX: Found an estimated cost of 20 for instruction:   %shl
+; AVX2: Found an estimated cost of 1 for instruction:   %shl
+
+
+; v16i16 and v8i32 shift left by non-uniform constant are lowered into
+; vector multiply instructions.  With AVX (but not AVX2), the vector multiply
+; is lowered into a sequence of: 1 extract + 2 vpmullw + 1 insert.
+;
+; With AVX2, instruction vpmullw works with 256bit quantities and
+; therefore there is no need to split the resulting vector multiply into
+; a sequence of two multiply.
+;
+; With SSE2 and SSE4.1, the vector shift cost for 'test6' is twice
+; the cost computed in the case of 'test1'. That is because the backend
+; simply emits 2 pmullw with no extract/insert.
+
+
+define <16 x i16> @test6(<16 x i16> %a) {
+  %shl = shl <16 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <16 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test6':
+; SSE2: Found an estimated cost of 2 for instruction:   %shl
+; SSE41: Found an estimated cost of 2 for instruction:   %shl
+; AVX: Found an estimated cost of 4 for instruction:   %shl
+; AVX2: Found an estimated cost of 1 for instruction:   %shl
+
+
+; With SSE2 and SSE4.1, the vector shift cost for 'test7' is twice
+; the cost computed in the case of 'test3'. That is because the multiply
+; is type-legalized into two 4i32 vector multiply.
+
+define <8 x i32> @test7(<8 x i32> %a) {
+  %shl = shl <8 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+  ret <8 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test7':
+; SSE2: Found an estimated cost of 12 for instruction:   %shl
+; SSE41: Found an estimated cost of 2 for instruction:   %shl
+; AVX: Found an estimated cost of 4 for instruction:   %shl
+; AVX2: Found an estimated cost of 1 for instruction:   %shl
+
+
+; On AVX2 we are able to lower the following shift into a single
+; vpsllvq. Therefore, the expected cost is only 1.
+; In all other cases, this shift is scalarized as the target does not support
+; vpsllv instructions.
+
+define <4 x i64> @test8(<4 x i64> %a) {
+  %shl = shl <4 x i64> %a, <i64 1, i64 2, i64 3, i64 4>
+  ret <4 x i64> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test8':
+; SSE2: Found an estimated cost of 40 for instruction:   %shl
+; SSE41: Found an estimated cost of 40 for instruction:   %shl
+; AVX: Found an estimated cost of 40 for instruction:   %shl
+; AVX2: Found an estimated cost of 1 for instruction:   %shl
+
+
+; Same as 'test6', with the difference that the cost is double.
+
+define <32 x i16> @test9(<32 x i16> %a) {
+  %shl = shl <32 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <32 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test9':
+; SSE2: Found an estimated cost of 4 for instruction:   %shl
+; SSE41: Found an estimated cost of 4 for instruction:   %shl
+; AVX: Found an estimated cost of 8 for instruction:   %shl
+; AVX2: Found an estimated cost of 2 for instruction:   %shl
+
+
+; Same as 'test7', except that now the cost is double.
+
+define <16 x i32> @test10(<16 x i32> %a) {
+  %shl = shl <16 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+  ret <16 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test10':
+; SSE2: Found an estimated cost of 24 for instruction:   %shl
+; SSE41: Found an estimated cost of 4 for instruction:   %shl
+; AVX: Found an estimated cost of 8 for instruction:   %shl
+; AVX2: Found an estimated cost of 2 for instruction:   %shl
+
+
+; On AVX2 we are able to lower the following shift into a sequence of
+; two vpsllvq instructions. Therefore, the expected cost is only 2.
+; In all other cases, this shift is scalarized as we don't have vpsllv
+; instructions.
+
+define <8 x i64> @test11(<8 x i64> %a) {
+  %shl = shl <8 x i64> %a, <i64 1, i64 1, i64 2, i64 3, i64 1, i64 1, i64 2, i64 3>
+  ret <8 x i64> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test11':
+; SSE2: Found an estimated cost of 80 for instruction:   %shl
+; SSE41: Found an estimated cost of 80 for instruction:   %shl
+; AVX: Found an estimated cost of 80 for instruction:   %shl
+; AVX2: Found an estimated cost of 2 for instruction:   %shl
+
+
-------------- next part --------------
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================
--- lib/Target/X86/X86ISelLowering.cpp	(revision 201171)
+++ lib/Target/X86/X86ISelLowering.cpp	(working copy)
@@ -13156,6 +13156,39 @@
       return Op;
   }
 
+  // If possible, lower this packed shift into a vector multiply instead of
+  // expanding it into a sequence of scalar shifts.
+  // Do this only if the vector shift count is a constant build_vector.
+  if (Op.getOpcode() == ISD::SHL && 
+      (VT == MVT::v8i16 || VT == MVT::v4i32 ||
+       (Subtarget->hasInt256() && VT == MVT::v16i16)) &&
+      ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {
+    SmallVector<SDValue, 8> Elts;
+    EVT SVT = VT.getScalarType();
+    unsigned SVTBits = SVT.getSizeInBits();
+    const APInt &One = APInt(SVTBits, 1);
+    unsigned NumElems = VT.getVectorNumElements();
+
+    for (unsigned i=0; i !=NumElems; ++i) {
+      SDValue Op = Amt->getOperand(i);
+      if (Op->getOpcode() == ISD::UNDEF) {
+        Elts.push_back(Op);
+        continue;
+      }
+
+      ConstantSDNode *ND = cast<ConstantSDNode>(Op);
+      const APInt &C = APInt(SVTBits, ND->getAPIntValue().getZExtValue());
+      uint64_t ShAmt = C.getZExtValue();
+      if (ShAmt >= SVTBits) {
+        Elts.push_back(DAG.getUNDEF(SVT));
+        continue;
+      }
+      Elts.push_back(DAG.getConstant(One.shl(ShAmt), SVT));
+    }
+    SDValue BV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, &Elts[0], NumElems);
+    return DAG.getNode(ISD::MUL, dl, VT, R, BV);
+  }
+ 
   // Lower SHL with variable shift amount.
   if (VT == MVT::v4i32 && Op->getOpcode() == ISD::SHL) {
     Op = DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(23, VT));
Index: test/CodeGen/X86/vec_shift6.ll
===================================================================
--- test/CodeGen/X86/vec_shift6.ll	(revision 0)
+++ test/CodeGen/X86/vec_shift6.ll	(revision 0)
@@ -0,0 +1,134 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 | FileCheck %s -check-prefix=CHECK -check-prefix=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=core-avx2 | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2 -check-prefix=AVX2ONLY
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=knl | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2 -check-prefix=AVX512
+
+
+; Verify that we don't scalarize a packed vector shift left of 16-bit
+; signed integers if the amount is a constant build_vector.
+; Check that we produce a SSE2 packed integer multiply (pmullw) instead.
+
+define <8 x i16> @test1(<8 x i16> %a) {
+  %shl = shl <8 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <8 x i16> %shl
+}
+; CHECK-LABEL: test1
+; CHECK: pmullw
+; CHECK-NEXT: ret
+
+
+define <8 x i16> @test2(<8 x i16> %a) {
+  %shl = shl <8 x i16> %a, <i16 0, i16 undef, i16 0, i16 0, i16 1, i16 undef, i16 -1, i16 1>
+  ret <8 x i16> %shl
+}
+; CHECK-LABEL: test2
+; CHECK: pmullw
+; CHECK-NEXT: ret
+
+
+; Verify that a vector shift left of 32-bit signed integers is simply expanded
+; into a SSE4.1 pmulld (instead of cvttps2dq + pmulld) if the vector of shift
+; counts is a constant build_vector.
+
+define <4 x i32> @test3(<4 x i32> %a) {
+  %shl = shl <4 x i32> %a, <i32 1, i32 -1, i32 2, i32 -3>
+  ret <4 x i32> %shl
+}
+; CHECK-LABEL: test3
+; CHECK-NOT: cvttps2dq
+; SSE: pmulld
+; AVX2: vpsllvd
+; CHECK-NEXT: ret
+
+
+define <4 x i32> @test4(<4 x i32> %a) {
+  %shl = shl <4 x i32> %a, <i32 0, i32 0, i32 1, i32 1>
+  ret <4 x i32> %shl
+}
+; CHECK-LABEL: test4
+; CHECK-NOT: cvttps2dq
+; SSE: pmulld
+; AVX2: vpsllvd
+; CHECK-NEXT: ret
+
+
+; If we have AVX/SSE2 but not AVX2, verify that the following shift is split
+; into two pmullw instructions. With AVX2, the test case below would produce
+; a single vpmullw.
+
+define <16 x i16> @test5(<16 x i16> %a) {
+  %shl = shl <16 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <16 x i16> %shl
+}
+; CHECK-LABEL: test5
+; SSE: pmullw
+; SSE-NEXT: pmullw
+; AVX2: vpmullw
+; AVX2-NOT: vpmullw
+; CHECK: ret
+
+
+; If we have AVX/SSE4.1 but not AVX2, verify that the following shift is split
+; into two pmulld instructions. With AVX2, the test case below would produce
+; a single vpsllvd instead.
+
+define <8 x i32> @test6(<8 x i32> %a) {
+  %shl = shl <8 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+  ret <8 x i32> %shl
+}
+; CHECK-LABEL: test6
+; SSE: pmulld
+; SSE-NEXT: pmulld
+; AVX2: vpsllvd
+; CHECK: ret
+
+
+; With AVX2 and AVX512, the test case below should produce a sequence of
+; two vpmullw instructions. On SSE2 instead, we split the shift in four
+; parts and then we convert each part into a pmullw.
+
+define <32 x i16> @test7(<32 x i16> %a) {
+  %shl = shl <32 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+  ret <32 x i16> %shl
+}
+; CHECK-LABEL: test7
+; SSE: pmullw
+; SSE-NEXT: pmullw
+; SSE-NEXT: pmullw
+; SSE-NEXT: pmullw
+; AVX2: vpmullw
+; AVX2-NEXT: vpmullw
+; CHECK: ret
+
+
+; Similar to test7; the difference is that with AVX512 support
+; we only produce a single vpsllvd/vpsllvq instead of a pair of vpsllvd/vpsllvq.
+
+define <16 x i32> @test8(<16 x i32> %a) {
+  %shl = shl <16 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+  ret <16 x i32> %shl
+}
+; CHECK-LABEL: test8
+; SSE: pmulld
+; SSE-NEXT: pmulld
+; SSE-NEXT: pmulld
+; SSE-NEXT: pmulld
+; AVX2ONLY: vpsllvd
+; AVX2ONLY-NEXT: vpsllvd
+; AVX512: vpsllvd
+; AVX512-NOT: vpsllvd
+; CHECK: ret
+
+
+; The shift from 'test9' gets scalarized if we don't have AVX2/AVX512f support.
+
+define <8 x i64> @test9(<8 x i64> %a) {
+  %shl = shl <8 x i64> %a, <i64 1, i64 1, i64 2, i64 3, i64 1, i64 1, i64 2, i64 3>
+  ret <8 x i64> %shl
+}
+; CHECK-LABEL: test9
+; AVX2ONLY: vpsllvq
+; AVX2ONLY-NEXT: vpsllvq
+; AVX512: vpsllvq
+; AVX512-NOT: vpsllvq
+; CHECK: ret
+
Index: test/CodeGen/X86/avx-shift.ll
===================================================================
--- test/CodeGen/X86/avx-shift.ll	(revision 201171)
+++ test/CodeGen/X86/avx-shift.ll	(working copy)
@@ -115,8 +115,8 @@
 ; PR15141
 ; CHECK: _vshift13:
 ; CHECK-NOT: vpsll
-; CHECK: vcvttps2dq
-; CHECK-NEXT: vpmulld
+; CHECK-NOT: vcvttps2dq
+; CHECK: vpmulld
 define <4 x i32> @vshift13(<4 x i32> %in) {
   %T = shl <4 x i32> %in, <i32 0, i32 1, i32 2, i32 4>
   ret <4 x i32> %T