[PATCH][X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.
Andrea Di Biagio
andrea.dibiagio at gmail.com
Wed Feb 12 12:04:46 PST 2014
Hi Nadav,
This new patch should fix the Vectorizer as well.
Differences are:
- method 'getEntryCost' in SLPVectorizer is now able to correctly
identify uniform and non-uniform constant values.
- method getInstrCost in the BBVectorize now accepts two extra
operands of type OperandValueKind.
- method areInstsCompatible in BBVectorize is now able to correctly
compute the cost of a merged shift instruction based on the knowledge
of the operand value kinds.
Example 1.
given the following pair:
%shl1 = shl A, 1
%shl2 = shl B, 1
The operand value kind for the second operand of the merged shift
will be equal to OK_UniformConstantValue
Example 2.
given the following pair:
%shl1 = shl A, 1
%shl2 = shl B, 2
The operand value kind for the second operand of the merged shift
will be equal to OK_NonUniformConstantValue.
Please let me know if ok to submit.
Thanks :-)
Andrea
P.s.: on a slightly different topic.
While testing the SLP vectorizer, I noticed that method
'vectorizeStoreChain' conservatively limits the length of a
vectorizable chain of instructions based on the value of the global
variable 'MinVecRegSize' (which is always equal to 128).
My question is: when looking for profitable vectorizable trees,
wouldn't it make more sense to set variable VF based (also) on the
value returned by TTI::getRegisterBitWidth()? I think this could have
a positive effect if the target supports for example 256bit vectors
(so, AVX/AVX2 and AVX512f). That said, I am not an expert of
SLPVectorizer and therefore I may have misunderstood the behavior of
that method :-)
On Wed, Feb 12, 2014 at 3:34 AM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi Andrea,
>
> You added a new kind of OperandValueKind and you connected the cost model computation, but you forgot one big piece: the vectorizer! The vectorizer needs to use the new OperandValueKind in order to efficiently estimate the cost of the shifts. The patches you sent LGTM, but I would also like the vectorizer to use the new operand value kind. :)
>
> Thanks,
> Nadav
>
>
>
> On Feb 11, 2014, at 12:29 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>
>> Hi Nadav and Jim,
>>
>> Sorry for sending another version of the patch but I think I found how
>> to properly fix/improve the cost model.
>> For simplicity I have split the patch into two: a patch that
>> introduces the new lowering rule for vector shifts and a patch that
>> improves the x86 cost model.
>>
>> Test vec_shift6.ll should now cover all the cases where vector shifts
>> are expanded into multiply instructions.
>>
>> I introduced into the cost model a new 'OperandValueKind' to identify
>> operands which are constants but not constant splats.
>> I called it 'OK_NonUniformConstValue' to distinguish it from
>> 'OK_UniformConstantValue' which is used for splat values.
>>
>> I modified CostModel.cpp to allow returning 'OK_NonUniformConstValue'
>> for non splat operands that are instances of ConstantVector or
>> ConstantDataVector.
>>
>> I verified that the cost model still produces the expected results on
>> other non x86 targets (ARM and PPC).
>> On the X86 backend, I modified X86TargetTransformInfo to produce the
>> 'expected' cost for vector shift left instructions that can be lowered
>> as a vector multiply.
>>
>> Finally I added a new test to verify that the output of opt
>> '-cost-model -analyze' is valid in the following configurations: SSE2,
>> SSE4.1, AVX, AVX2.
>>
>> I didn't add a cost model for AVX512f, but I think (if it is ok for
>> you) that this can be done as a separate patch.
>>
>> Please let me know what you think.
>>
>> Thanks,
>> Andrea
>>
>> On Fri, Feb 7, 2014 at 8:51 PM, Andrea Di Biagio
>> <andrea.dibiagio at gmail.com> wrote:
>>> Hi Jim and Nadav,
>>>
>>> here is a new version of the patch.
>>>
>>> The cost model is able to deal with most of the cases where the second
>>> operand of a SHL is of kind OK_UniformConstantValue.
>>> As far as I can see, the only missing rule is for the case where we
>>> have a v16i16 SHL in AVX2.
>>> I added a rule in X86TTI::getArithmeticInstrCost to update the cost
>>> for that specific case.
>>>
>>> One of the problems I have encountered is that method 'getOperandInfo'
>>> in CostModel either returns OK_AnyValue or OK_UniformConstantValue (I
>>> didn't find any case where we return OK_UniformValue..).
>>> All the shifts optimized by my patch are shifts where the second
>>> operand is a ConstantDataVector but not a splat. Therefore, function
>>> getOperandInfo in CostModel.cpp will always return OK_AnyValue for
>>> them.
>>>
>>> In conclusion, I am not sure if I have modified that part correctly.
>>> Please let me know if you had something else in mind and in case what
>>> should be the correct way to improve that part.
>>>
>>> I investigated other opportunties for applying this transformation to
>>> other vector types.
>>> This new patch improves my previous patch since we now know how to
>>> expand a packed v16i16 shift left into a single AVX2 vpmullw when the
>>> shift amount is a constant build_vector.
>>>
>>> Without AVX2 support, a packed v16i16 shift left is usually decomposed
>>> into two 128-bits shifts. Each new shift is then properly expanded
>>> into a multiply instruction (pmullw) thanks to the new transformation
>>> introduced by this patch.
>>>
>>> The backend already knows how to efficiently lower v8i32 shifts with
>>> AVX2: v8i32 shifts are expanded into VPSLLV instructions.
>>> Also, the backend already knows how to emit the AVX512f version of
>>> VPSLLVD and VPSLLVQ in the case of v1632 and v8i64 vectors.
>>> AVX512f seems to benefit alot from this new transformation (you can
>>> see it if you compare the output of test avx_shift6.ll when run for
>>> AVX512 before and after applying my patch).
>>>
>>> A vector shift left by constant v32i16 build_vector is initally split
>>> into two v16i16 shifts during type-legalization. Eventually the new
>>> algorithm converts the resulting shifts into multiply instructions.
>>>
>>> See new test-cases avx_shift6.ll for more details.
>>>
>>> Please let me know what you think.
>>>
>>> Thanks,
>>> Andrea
>>>
>>> On Thu, Feb 6, 2014 at 11:11 PM, Andrea Di Biagio
>>> <andrea.dibiagio at gmail.com> wrote:
>>>> Hi Jim and Nadav,
>>>>
>>>> thanks for the feedback!
>>>>
>>>> On Thu, Feb 6, 2014 at 10:33 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>>> Please remember to update the x86 cost model in X86TTI::getArithmeticInstrCost. In this function you should be able to check if the LHS is a constant.
>>>>
>>>> Ok, I will update the cost model.
>>>>
>>>>> On Feb 6, 2014, at 2:26 PM, Jim Grosbach <grosbach at apple.com> wrote:
>>>>>
>>>>>> Hi Andrea,
>>>>>>
>>>>>> This is a very nice improvement, but should do a bit more, I believe.
>>>>>>
>>>>>> AVX2 adds 256-bit wide vector versions of these instructions, so if AVX2 is available, the same transformation should be applied to v16i16 and v8i32 shifts. Worth looking to see if AVX512 extends
>>>>>>
>>>>>> The test cases should check that when compiling for AVX, the VEX prefixed form of the instructions are generated instead of the SSE versions.
>>>>>>
>>>>
>>>> Sure, I will send a new version of the patch that also improves AVX2
>>>> code generation in the case of v16i16 and v8i32. I'll also have a look
>>>> at AVX512 to identify cases that can also be improved.
>>>>
>>>> On my previous email I forgot to mention that this patch would also
>>>> fix bug 18478 (http://llvm.org/bugs/show_bug.cgi?id=18478)
>>>>
>>>> Thanks again for your time.
>>>> Andrea
>>>>
>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> This patch teaches the backend how to efficiently lower a packed
>>>>>>> vector shift left into a packed vector multiply if the vector of shift
>>>>>>> counts is known to be constant (i.e. a constant build_vector).
>>>>>>>
>>>>>>> Instead of expanding a packed shift into a sequence of scalar shifts,
>>>>>>> the backend should try (when possible) to convert the vector shift
>>>>>>> into a vector multiply.
>>>>>>>
>>>>>>> Before this patch, a shift of a MVT::v8i16 vector by a build_vector of
>>>>>>> constants was always scalarized into a long sequence of "vector
>>>>>>> extracts + scalar shifts + vector insert".
>>>>>>> With this patch, if there is SSE2 support, we emit a single vector multiply.
>>>>>>>
>>>>>>> The new x86 test 'vec_shift6.ll' contains some examples of code that
>>>>>>> are affected by this patch.
>>>>>>>
>>>>>>> Please let me know if ok to submit.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrea Di Biagio
>>>>>>> SN Systems - Sony Computer Entertainment Group
>>>>>>> <patch.diff>_______________________________________________
>>>>>>> llvm-commits mailing list
>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>
>>>>>
>> <patch-cost-model.diff><patch-v2.diff>
>
-------------- next part --------------
Index: include/llvm/Analysis/TargetTransformInfo.h
===================================================================
--- include/llvm/Analysis/TargetTransformInfo.h (revision 201248)
+++ include/llvm/Analysis/TargetTransformInfo.h (working copy)
@@ -321,9 +321,10 @@
/// \brief Additional information about an operand's possible values.
enum OperandValueKind {
- OK_AnyValue, // Operand can have any value.
- OK_UniformValue, // Operand is uniform (splat of a value).
- OK_UniformConstantValue // Operand is uniform constant.
+ OK_AnyValue, // Operand can have any value.
+ OK_UniformValue, // Operand is uniform (splat of a value).
+ OK_UniformConstantValue, // Operand is uniform constant.
+ OK_NonUniformConstantValue // Operand is a non uniform constant value.
};
/// \return The number of scalar or vector registers that the target has.
Index: lib/Analysis/CostModel.cpp
===================================================================
--- lib/Analysis/CostModel.cpp (revision 201248)
+++ lib/Analysis/CostModel.cpp (working copy)
@@ -98,15 +98,20 @@
TargetTransformInfo::OperandValueKind OpInfo =
TargetTransformInfo::OK_AnyValue;
- // Check for a splat of a constant.
+ // Check for a splat of a constant or for a non uniform vector of constants.
ConstantDataVector *CDV = 0;
- if ((CDV = dyn_cast<ConstantDataVector>(V)))
+ if ((CDV = dyn_cast<ConstantDataVector>(V))) {
+ OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
if (CDV->getSplatValue() != NULL)
OpInfo = TargetTransformInfo::OK_UniformConstantValue;
+ }
+
ConstantVector *CV = 0;
- if ((CV = dyn_cast<ConstantVector>(V)))
+ if ((CV = dyn_cast<ConstantVector>(V))) {
+ OpInfo = TargetTransformInfo::OK_NonUniformConstantValue;
if (CV->getSplatValue() != NULL)
OpInfo = TargetTransformInfo::OK_UniformConstantValue;
+ }
return OpInfo;
}
Index: lib/Target/X86/X86TargetTransformInfo.cpp
===================================================================
--- lib/Target/X86/X86TargetTransformInfo.cpp (revision 201248)
+++ lib/Target/X86/X86TargetTransformInfo.cpp (working copy)
@@ -225,6 +225,13 @@
// Look for AVX2 lowering tricks.
if (ST->hasAVX2()) {
+ if (ISD == ISD::SHL && LT.second == MVT::v16i16 &&
+ (Op2Info == TargetTransformInfo::OK_UniformConstantValue ||
+ Op2Info == TargetTransformInfo::OK_NonUniformConstantValue))
+ // On AVX2, a packed v16i16 shift left by a constant build_vector
+ // is lowered into a vector multiply (vpmullw).
+ return LT.first;
+
int Idx = CostTableLookup(AVX2CostTable, ISD, LT.second);
if (Idx != -1)
return LT.first * AVX2CostTable[Idx].Cost;
@@ -257,6 +264,20 @@
return LT.first * SSE2UniformConstCostTable[Idx].Cost;
}
+ if (ISD == ISD::SHL &&
+ Op2Info == TargetTransformInfo::OK_NonUniformConstantValue) {
+ EVT VT = LT.second;
+ if ((VT == MVT::v8i16 && ST->hasSSE2()) ||
+ (VT == MVT::v4i32 && ST->hasSSE41()))
+ // Vector shift left by non uniform constant can be lowered
+ // into vector multiply (pmullw/pmulld).
+ return LT.first;
+ if (VT == MVT::v4i32 && ST->hasSSE2())
+ // A vector shift left by non uniform constant is converted
+ // into a vector multiply; the new multiply is eventually
+ // lowered into a sequence of shuffles and 2 x pmuludq.
+ ISD = ISD::MUL;
+ }
static const CostTblEntry<MVT::SimpleValueType> SSE2CostTable[] = {
// We don't correctly identify costs of casts because they are marked as
@@ -271,6 +292,7 @@
{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.
+ { ISD::SHL, MVT::v4i64, 4*10 }, // Scalarized.
{ ISD::SRL, MVT::v16i8, 16*10 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.
@@ -308,6 +330,7 @@
// We don't have to scalarize unsupported ops. We can issue two half-sized
// operations and we only need to extract the upper YMM half.
// Two ops + 1 extract + 1 insert = 4.
+ { ISD::MUL, MVT::v16i16, 4 },
{ ISD::MUL, MVT::v8i32, 4 },
{ ISD::SUB, MVT::v8i32, 4 },
{ ISD::ADD, MVT::v8i32, 4 },
@@ -323,6 +346,14 @@
// Look for AVX1 lowering tricks.
if (ST->hasAVX() && !ST->hasAVX2()) {
+ EVT VT = LT.second;
+
+ // v16i16 and v8i32 shifts by non-uniform constants are lowered into a
+ // sequence of extract + two vector multiply + insert.
+ if (ISD == ISD::SHL && (VT == MVT::v8i32 || VT == MVT::v16i16) &&
+ Op2Info == TargetTransformInfo::OK_NonUniformConstantValue)
+ ISD = ISD::MUL;
+
int Idx = CostTableLookup(AVX1CostTable, ISD, LT.second);
if (Idx != -1)
return LT.first * AVX1CostTable[Idx].Cost;
@@ -343,7 +374,7 @@
// 2x pmuludq, 2x shuffle.
if (ISD == ISD::MUL && LT.second == MVT::v4i32 && ST->hasSSE2() &&
!ST->hasSSE41())
- return 6;
+ return LT.first * 6;
// Fallback to the default implementation.
return TargetTransformInfo::getArithmeticInstrCost(Opcode, Ty, Op1Info,
Index: test/Analysis/CostModel/X86/vshift-cost.ll
===================================================================
--- test/Analysis/CostModel/X86/vshift-cost.ll (revision 0)
+++ test/Analysis/CostModel/X86/vshift-cost.ll (revision 0)
@@ -0,0 +1,167 @@
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 -mattr=+sse2,-sse4.1 -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=SSE2
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=SSE41
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7-avx -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=AVX
+; RUN: opt < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=core-avx2 -cost-model -analyze | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2
+
+
+; Verify the cost of vector shift left instructions.
+
+; We always emit a single pmullw in the case of v8i16 vector shifts by
+; non-uniform constant.
+
+define <8 x i16> @test1(<8 x i16> %a) {
+ %shl = shl <8 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+ ret <8 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test1':
+; CHECK: Found an estimated cost of 1 for instruction: %shl
+
+
+define <8 x i16> @test2(<8 x i16> %a) {
+ %shl = shl <8 x i16> %a, <i16 0, i16 undef, i16 0, i16 0, i16 1, i16 undef, i16 -1, i16 1>
+ ret <8 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test2':
+; CHECK: Found an estimated cost of 1 for instruction: %shl
+
+
+; With SSE4.1, v4i32 shifts can be lowered into a single pmulld instruction.
+; Make sure that the estimated cost is always 1 except for the case where
+; we only have SSE2 support. With SSE2, we are forced to special lower the
+; v4i32 mul as a 2x shuffle, 2x pmuludq, 2x shuffle.
+
+define <4 x i32> @test3(<4 x i32> %a) {
+ %shl = shl <4 x i32> %a, <i32 1, i32 -1, i32 2, i32 -3>
+ ret <4 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test3':
+; SSE2: Found an estimated cost of 6 for instruction: %shl
+; SSE41: Found an estimated cost of 1 for instruction: %shl
+; AVX: Found an estimated cost of 1 for instruction: %shl
+; AVX2: Found an estimated cost of 1 for instruction: %shl
+
+
+define <4 x i32> @test4(<4 x i32> %a) {
+ %shl = shl <4 x i32> %a, <i32 0, i32 0, i32 1, i32 1>
+ ret <4 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test4':
+; SSE2: Found an estimated cost of 6 for instruction: %shl
+; SSE41: Found an estimated cost of 1 for instruction: %shl
+; AVX: Found an estimated cost of 1 for instruction: %shl
+; AVX2: Found an estimated cost of 1 for instruction: %shl
+
+
+; On AVX2 we are able to lower the following shift into a single
+; vpsllvq. Therefore, the expected cost is only 1.
+; In all other cases, this shift is scalarized as the target does not support
+; vpsllv instructions.
+
+define <2 x i64> @test5(<2 x i64> %a) {
+ %shl = shl <2 x i64> %a, <i64 2, i64 3>
+ ret <2 x i64> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test5':
+; SSE2: Found an estimated cost of 20 for instruction: %shl
+; SSE41: Found an estimated cost of 20 for instruction: %shl
+; AVX: Found an estimated cost of 20 for instruction: %shl
+; AVX2: Found an estimated cost of 1 for instruction: %shl
+
+
+; v16i16 and v8i32 shift left by non-uniform constant are lowered into
+; vector multiply instructions. With AVX (but not AVX2), the vector multiply
+; is lowered into a sequence of: 1 extract + 2 vpmullw + 1 insert.
+;
+; With AVX2, instruction vpmullw works with 256bit quantities and
+; therefore there is no need to split the resulting vector multiply into
+; a sequence of two multiply.
+;
+; With SSE2 and SSE4.1, the vector shift cost for 'test6' is twice
+; the cost computed in the case of 'test1'. That is because the backend
+; simply emits 2 pmullw with no extract/insert.
+
+
+define <16 x i16> @test6(<16 x i16> %a) {
+ %shl = shl <16 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+ ret <16 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test6':
+; SSE2: Found an estimated cost of 2 for instruction: %shl
+; SSE41: Found an estimated cost of 2 for instruction: %shl
+; AVX: Found an estimated cost of 4 for instruction: %shl
+; AVX2: Found an estimated cost of 1 for instruction: %shl
+
+
+; With SSE2 and SSE4.1, the vector shift cost for 'test7' is twice
+; the cost computed in the case of 'test3'. That is because the multiply
+; is type-legalized into two 4i32 vector multiply.
+
+define <8 x i32> @test7(<8 x i32> %a) {
+ %shl = shl <8 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+ ret <8 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test7':
+; SSE2: Found an estimated cost of 12 for instruction: %shl
+; SSE41: Found an estimated cost of 2 for instruction: %shl
+; AVX: Found an estimated cost of 4 for instruction: %shl
+; AVX2: Found an estimated cost of 1 for instruction: %shl
+
+
+; On AVX2 we are able to lower the following shift into a single
+; vpsllvq. Therefore, the expected cost is only 1.
+; In all other cases, this shift is scalarized as the target does not support
+; vpsllv instructions.
+
+define <4 x i64> @test8(<4 x i64> %a) {
+ %shl = shl <4 x i64> %a, <i64 1, i64 2, i64 3, i64 4>
+ ret <4 x i64> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test8':
+; SSE2: Found an estimated cost of 40 for instruction: %shl
+; SSE41: Found an estimated cost of 40 for instruction: %shl
+; AVX: Found an estimated cost of 40 for instruction: %shl
+; AVX2: Found an estimated cost of 1 for instruction: %shl
+
+
+; Same as 'test6', with the difference that the cost is double.
+
+define <32 x i16> @test9(<32 x i16> %a) {
+ %shl = shl <32 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+ ret <32 x i16> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test9':
+; SSE2: Found an estimated cost of 4 for instruction: %shl
+; SSE41: Found an estimated cost of 4 for instruction: %shl
+; AVX: Found an estimated cost of 8 for instruction: %shl
+; AVX2: Found an estimated cost of 2 for instruction: %shl
+
+
+; Same as 'test7', except that now the cost is double.
+
+define <16 x i32> @test10(<16 x i32> %a) {
+ %shl = shl <16 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+ ret <16 x i32> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test10':
+; SSE2: Found an estimated cost of 24 for instruction: %shl
+; SSE41: Found an estimated cost of 4 for instruction: %shl
+; AVX: Found an estimated cost of 8 for instruction: %shl
+; AVX2: Found an estimated cost of 2 for instruction: %shl
+
+
+; On AVX2 we are able to lower the following shift into a sequence of
+; two vpsllvq instructions. Therefore, the expected cost is only 2.
+; In all other cases, this shift is scalarized as we don't have vpsllv
+; instructions.
+
+define <8 x i64> @test11(<8 x i64> %a) {
+ %shl = shl <8 x i64> %a, <i64 1, i64 1, i64 2, i64 3, i64 1, i64 1, i64 2, i64 3>
+ ret <8 x i64> %shl
+}
+; CHECK: 'Cost Model Analysis' for function 'test11':
+; SSE2: Found an estimated cost of 80 for instruction: %shl
+; SSE41: Found an estimated cost of 80 for instruction: %shl
+; AVX: Found an estimated cost of 80 for instruction: %shl
+; AVX2: Found an estimated cost of 2 for instruction: %shl
+
+
Index: lib/Transforms/Vectorize/SLPVectorizer.cpp
===================================================================
--- lib/Transforms/Vectorize/SLPVectorizer.cpp (revision 201248)
+++ lib/Transforms/Vectorize/SLPVectorizer.cpp (working copy)
@@ -1044,12 +1044,26 @@
TargetTransformInfo::OperandValueKind Op2VK =
TargetTransformInfo::OK_UniformConstantValue;
- // Check whether all second operands are constant.
- for (unsigned i = 0; i < VL.size(); ++i)
- if (!isa<ConstantInt>(cast<Instruction>(VL[i])->getOperand(1))) {
+ // If all operands are exactly the same ConstantInt then set the
+ // operand kind to OK_UniformConstantValue.
+ // If instead not all operands are constants, then set the operand kind
+ // to OK_AnyValue. If all operands are constants but not the same,
+ // then set the operand kind to OK_NonUniformConstantValue.
+ ConstantInt *CInt = NULL;
+ for (unsigned i = 0; i < VL.size(); ++i) {
+ const Instruction *I = cast<Instruction>(VL[i]);
+ if (!isa<ConstantInt>(I->getOperand(1))) {
Op2VK = TargetTransformInfo::OK_AnyValue;
break;
}
+ if (i == 0) {
+ CInt = cast<ConstantInt>(I->getOperand(1));
+ continue;
+ }
+ if (Op2VK == TargetTransformInfo::OK_UniformConstantValue &&
+ CInt != cast<ConstantInt>(I->getOperand(1)))
+ Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+ }
ScalarCost =
VecTy->getNumElements() *
Index: lib/Transforms/Vectorize/BBVectorize.cpp
===================================================================
--- lib/Transforms/Vectorize/BBVectorize.cpp (revision 201248)
+++ lib/Transforms/Vectorize/BBVectorize.cpp (working copy)
@@ -532,7 +532,11 @@
// Returns the cost of the provided instruction using TTI.
// This does not handle loads and stores.
- unsigned getInstrCost(unsigned Opcode, Type *T1, Type *T2) {
+ unsigned getInstrCost(unsigned Opcode, Type *T1, Type *T2,
+ TargetTransformInfo::OperandValueKind Op1VK =
+ TargetTransformInfo::OK_AnyValue,
+ TargetTransformInfo::OperandValueKind Op2VK =
+ TargetTransformInfo::OK_AnyValue) {
switch (Opcode) {
default: break;
case Instruction::GetElementPtr:
@@ -562,7 +566,7 @@
case Instruction::And:
case Instruction::Or:
case Instruction::Xor:
- return TTI->getArithmeticInstrCost(Opcode, T1);
+ return TTI->getArithmeticInstrCost(Opcode, T1, Op1VK, Op2VK);
case Instruction::Select:
case Instruction::ICmp:
case Instruction::FCmp:
@@ -1013,13 +1017,58 @@
unsigned JCost = getInstrCost(J->getOpcode(), JT1, JT2);
Type *VT1 = getVecTypeForPair(IT1, JT1),
*VT2 = getVecTypeForPair(IT2, JT2);
+ TargetTransformInfo::OperandValueKind Op1VK =
+ TargetTransformInfo::OK_AnyValue;
+ TargetTransformInfo::OperandValueKind Op2VK =
+ TargetTransformInfo::OK_AnyValue;
+ // On some targets (example X86) the cost of a vector shift may vary
+ // depending on whether the second operand is a Uniform or
+ // NonUniform Constant.
+ switch (I->getOpcode()) {
+ default : break;
+ case Instruction::Shl:
+ case Instruction::LShr:
+ case Instruction::AShr:
+
+ // If both I and J are scalar shifts by constant, then the
+ // merged vector shift count would be either a constant splat value
+ // or a non-uniform vector of constants.
+ if (ConstantInt *CII = dyn_cast<ConstantInt>(I->getOperand(1))) {
+ if (ConstantInt *CIJ = dyn_cast<ConstantInt>(J->getOperand(1)))
+ Op2VK = CII == CIJ ? TargetTransformInfo::OK_UniformConstantValue :
+ TargetTransformInfo::OK_NonUniformConstantValue;
+ } else {
+ // Check for a splat of a constant or for a non uniform vector
+ // of constants.
+ Value *IOp = I->getOperand(1);
+ Value *JOp = J->getOperand(1);
+ if (ConstantDataVector *CDVI = dyn_cast<ConstantDataVector>(IOp)) {
+ if (ConstantDataVector *CDVJ = dyn_cast<ConstantDataVector>(JOp)) {
+ Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+ Constant *SplatValue = CDVI->getSplatValue();
+ if (SplatValue != NULL && SplatValue == CDVJ->getSplatValue())
+ Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+ }
+ }
+
+ if (ConstantVector *CVI = dyn_cast<ConstantVector>(IOp)) {
+ if (ConstantVector *CVJ = dyn_cast<ConstantVector>(JOp)) {
+ Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
+ Constant *SplatValue = CVI->getSplatValue();
+ if (SplatValue != NULL && SplatValue == CVJ->getSplatValue())
+ Op2VK = TargetTransformInfo::OK_UniformConstantValue;
+ }
+ }
+ }
+ }
+
// Note that this procedure is incorrect for insert and extract element
// instructions (because combining these often results in a shuffle),
// but this cost is ignored (because insert and extract element
// instructions are assigned a zero depth factor and are not really
// fused in general).
- unsigned VCost = getInstrCost(I->getOpcode(), VT1, VT2);
+ unsigned VCost = getInstrCost(I->getOpcode(), VT1, VT2, Op1VK, Op2VK);
if (VCost > ICost + JCost)
return false;
-------------- next part --------------
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================
--- lib/Target/X86/X86ISelLowering.cpp (revision 201248)
+++ lib/Target/X86/X86ISelLowering.cpp (working copy)
@@ -13156,6 +13156,39 @@
return Op;
}
+ // If possible, lower this packed shift into a vector multiply instead of
+ // expanding it into a sequence of scalar shifts.
+ // Do this only if the vector shift count is a constant build_vector.
+ if (Op.getOpcode() == ISD::SHL &&
+ (VT == MVT::v8i16 || VT == MVT::v4i32 ||
+ (Subtarget->hasInt256() && VT == MVT::v16i16)) &&
+ ISD::isBuildVectorOfConstantSDNodes(Amt.getNode())) {
+ SmallVector<SDValue, 8> Elts;
+ EVT SVT = VT.getScalarType();
+ unsigned SVTBits = SVT.getSizeInBits();
+ const APInt &One = APInt(SVTBits, 1);
+ unsigned NumElems = VT.getVectorNumElements();
+
+ for (unsigned i=0; i !=NumElems; ++i) {
+ SDValue Op = Amt->getOperand(i);
+ if (Op->getOpcode() == ISD::UNDEF) {
+ Elts.push_back(Op);
+ continue;
+ }
+
+ ConstantSDNode *ND = cast<ConstantSDNode>(Op);
+ const APInt &C = APInt(SVTBits, ND->getAPIntValue().getZExtValue());
+ uint64_t ShAmt = C.getZExtValue();
+ if (ShAmt >= SVTBits) {
+ Elts.push_back(DAG.getUNDEF(SVT));
+ continue;
+ }
+ Elts.push_back(DAG.getConstant(One.shl(ShAmt), SVT));
+ }
+ SDValue BV = DAG.getNode(ISD::BUILD_VECTOR, dl, VT, &Elts[0], NumElems);
+ return DAG.getNode(ISD::MUL, dl, VT, R, BV);
+ }
+
// Lower SHL with variable shift amount.
if (VT == MVT::v4i32 && Op->getOpcode() == ISD::SHL) {
Op = DAG.getNode(ISD::SHL, dl, VT, Amt, DAG.getConstant(23, VT));
Index: test/CodeGen/X86/vec_shift6.ll
===================================================================
--- test/CodeGen/X86/vec_shift6.ll (revision 0)
+++ test/CodeGen/X86/vec_shift6.ll (revision 0)
@@ -0,0 +1,134 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=corei7 | FileCheck %s -check-prefix=CHECK -check-prefix=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=core-avx2 | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2 -check-prefix=AVX2ONLY
+; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mcpu=knl | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2 -check-prefix=AVX512
+
+
+; Verify that we don't scalarize a packed vector shift left of 16-bit
+; signed integers if the amount is a constant build_vector.
+; Check that we produce a SSE2 packed integer multiply (pmullw) instead.
+
+define <8 x i16> @test1(<8 x i16> %a) {
+ %shl = shl <8 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+ ret <8 x i16> %shl
+}
+; CHECK-LABEL: test1
+; CHECK: pmullw
+; CHECK-NEXT: ret
+
+
+define <8 x i16> @test2(<8 x i16> %a) {
+ %shl = shl <8 x i16> %a, <i16 0, i16 undef, i16 0, i16 0, i16 1, i16 undef, i16 -1, i16 1>
+ ret <8 x i16> %shl
+}
+; CHECK-LABEL: test2
+; CHECK: pmullw
+; CHECK-NEXT: ret
+
+
+; Verify that a vector shift left of 32-bit signed integers is simply expanded
+; into a SSE4.1 pmulld (instead of cvttps2dq + pmulld) if the vector of shift
+; counts is a constant build_vector.
+
+define <4 x i32> @test3(<4 x i32> %a) {
+ %shl = shl <4 x i32> %a, <i32 1, i32 -1, i32 2, i32 -3>
+ ret <4 x i32> %shl
+}
+; CHECK-LABEL: test3
+; CHECK-NOT: cvttps2dq
+; SSE: pmulld
+; AVX2: vpsllvd
+; CHECK-NEXT: ret
+
+
+define <4 x i32> @test4(<4 x i32> %a) {
+ %shl = shl <4 x i32> %a, <i32 0, i32 0, i32 1, i32 1>
+ ret <4 x i32> %shl
+}
+; CHECK-LABEL: test4
+; CHECK-NOT: cvttps2dq
+; SSE: pmulld
+; AVX2: vpsllvd
+; CHECK-NEXT: ret
+
+
+; If we have AVX/SSE2 but not AVX2, verify that the following shift is split
+; into two pmullw instructions. With AVX2, the test case below would produce
+; a single vpmullw.
+
+define <16 x i16> @test5(<16 x i16> %a) {
+ %shl = shl <16 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+ ret <16 x i16> %shl
+}
+; CHECK-LABEL: test5
+; SSE: pmullw
+; SSE-NEXT: pmullw
+; AVX2: vpmullw
+; AVX2-NOT: vpmullw
+; CHECK: ret
+
+
+; If we have AVX/SSE4.1 but not AVX2, verify that the following shift is split
+; into two pmulld instructions. With AVX2, the test case below would produce
+; a single vpsllvd instead.
+
+define <8 x i32> @test6(<8 x i32> %a) {
+ %shl = shl <8 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+ ret <8 x i32> %shl
+}
+; CHECK-LABEL: test6
+; SSE: pmulld
+; SSE-NEXT: pmulld
+; AVX2: vpsllvd
+; CHECK: ret
+
+
+; With AVX2 and AVX512, the test case below should produce a sequence of
+; two vpmullw instructions. On SSE2 instead, we split the shift in four
+; parts and then we convert each part into a pmullw.
+
+define <32 x i16> @test7(<32 x i16> %a) {
+ %shl = shl <32 x i16> %a, <i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11, i16 1, i16 1, i16 2, i16 3, i16 7, i16 0, i16 9, i16 11>
+ ret <32 x i16> %shl
+}
+; CHECK-LABEL: test7
+; SSE: pmullw
+; SSE-NEXT: pmullw
+; SSE-NEXT: pmullw
+; SSE-NEXT: pmullw
+; AVX2: vpmullw
+; AVX2-NEXT: vpmullw
+; CHECK: ret
+
+
+; Similar to test7; the difference is that with AVX512 support
+; we only produce a single vpsllvd/vpsllvq instead of a pair of vpsllvd/vpsllvq.
+
+define <16 x i32> @test8(<16 x i32> %a) {
+ %shl = shl <16 x i32> %a, <i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3, i32 1, i32 1, i32 2, i32 3>
+ ret <16 x i32> %shl
+}
+; CHECK-LABEL: test8
+; SSE: pmulld
+; SSE-NEXT: pmulld
+; SSE-NEXT: pmulld
+; SSE-NEXT: pmulld
+; AVX2ONLY: vpsllvd
+; AVX2ONLY-NEXT: vpsllvd
+; AVX512: vpsllvd
+; AVX512-NOT: vpsllvd
+; CHECK: ret
+
+
+; The shift from 'test9' gets scalarized if we don't have AVX2/AVX512f support.
+
+define <8 x i64> @test9(<8 x i64> %a) {
+ %shl = shl <8 x i64> %a, <i64 1, i64 1, i64 2, i64 3, i64 1, i64 1, i64 2, i64 3>
+ ret <8 x i64> %shl
+}
+; CHECK-LABEL: test9
+; AVX2ONLY: vpsllvq
+; AVX2ONLY-NEXT: vpsllvq
+; AVX512: vpsllvq
+; AVX512-NOT: vpsllvq
+; CHECK: ret
+
Index: test/CodeGen/X86/avx-shift.ll
===================================================================
--- test/CodeGen/X86/avx-shift.ll (revision 201248)
+++ test/CodeGen/X86/avx-shift.ll (working copy)
@@ -115,8 +115,8 @@
; PR15141
; CHECK: _vshift13:
; CHECK-NOT: vpsll
-; CHECK: vcvttps2dq
-; CHECK-NEXT: vpmulld
+; CHECK-NOT: vcvttps2dq
+; CHECK: vpmulld
define <4 x i32> @vshift13(<4 x i32> %in) {
%T = shl <4 x i32> %in, <i32 0, i32 1, i32 2, i32 4>
ret <4 x i32> %T
More information about the llvm-commits
mailing list