[PATCH][X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.

Wed Feb 12 12:53:27 PST 2014

Almost there!  You forgot the LoopVectorizer, which should be easy to fix.  

Thanks,
Nadav

On Feb 12, 2014, at 12:04 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> Hi Nadav,
> 
> This new patch should fix the Vectorizer as well.
> 
> Differences are:
> - method 'getEntryCost' in SLPVectorizer is now able to correctly
> identify uniform and non-uniform constant values.
> - method getInstrCost in the BBVectorize now accepts two extra
> operands of type OperandValueKind.
> - method areInstsCompatible in BBVectorize is now able to correctly
> compute the cost of a merged shift instruction based on the knowledge
> of the operand value kinds.
>  Example 1.
>  given the following pair:
>      %shl1 = shl A, 1
>      %shl2 = shl B, 1
> 
>  The operand value kind for the second operand of the merged shift
> will be equal to OK_UniformConstantValue
> 
>  Example 2.
>  given the following pair:
>      %shl1 = shl A, 1
>      %shl2 = shl B, 2
> 
>  The operand value kind for the second operand of the merged shift
> will be equal to OK_NonUniformConstantValue.
> 
> 
> Please let me know if ok to submit.
> 
> Thanks :-)
> Andrea
> 
> P.s.: on a slightly different topic.
> While testing the SLP vectorizer, I noticed that method
> 'vectorizeStoreChain' conservatively limits the length of a
> vectorizable chain of instructions based on the value of the global
> variable 'MinVecRegSize' (which is always equal to 128).
> My question is: when looking for profitable vectorizable trees,
> wouldn't it make more sense to set variable VF based (also) on the
> value returned by TTI::getRegisterBitWidth()? I think this could have
> a positive effect if the target supports for example 256bit vectors
> (so, AVX/AVX2 and AVX512f). That said, I am not an expert of
> SLPVectorizer and therefore I may have misunderstood the behavior of
> that method :-)
> 
> On Wed, Feb 12, 2014 at 3:34 AM, Nadav Rotem <nrotem at apple.com> wrote:
>> Hi Andrea,
>> 
>> You added a new kind of OperandValueKind and you connected the cost model computation, but you forgot one big piece: the vectorizer!   The vectorizer needs to use the new OperandValueKind in order to efficiently estimate the cost of the shifts.  The patches you sent LGTM, but I would also like the vectorizer to use the new operand value kind.  :)
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> 
>> On Feb 11, 2014, at 12:29 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>> 
>>> Hi Nadav and Jim,
>>> 
>>> Sorry for sending another version of the patch but I think I found how
>>> to properly fix/improve the cost model.
>>> For simplicity I have split the patch into two: a patch that
>>> introduces the new lowering rule for vector shifts and a patch that
>>> improves the x86 cost model.
>>> 
>>> Test vec_shift6.ll should now cover all the cases where vector shifts
>>> are expanded into multiply instructions.
>>> 
>>> I introduced into the cost model a new 'OperandValueKind' to identify
>>> operands which are constants but not constant splats.
>>> I called it 'OK_NonUniformConstValue' to distinguish it from
>>> 'OK_UniformConstantValue' which is used for splat values.
>>> 
>>> I modified CostModel.cpp to allow returning 'OK_NonUniformConstValue'
>>> for non splat operands that are instances of ConstantVector or
>>> ConstantDataVector.
>>> 
>>> I verified that the cost model still produces the expected results on
>>> other non x86 targets (ARM and PPC).
>>> On the X86 backend, I modified X86TargetTransformInfo to produce the
>>> 'expected' cost for vector shift left instructions that can be lowered
>>> as a vector multiply.
>>> 
>>> Finally I added a new test to verify that the output of opt
>>> '-cost-model -analyze' is valid in the following configurations: SSE2,
>>> SSE4.1, AVX, AVX2.
>>> 
>>> I didn't add a cost model for AVX512f, but I think (if it is ok for
>>> you) that this can be done as a separate patch.
>>> 
>>> Please let me know what you think.
>>> 
>>> Thanks,
>>> Andrea
>>> 
>>> On Fri, Feb 7, 2014 at 8:51 PM, Andrea Di Biagio
>>> <andrea.dibiagio at gmail.com> wrote:
>>>> Hi Jim and Nadav,
>>>> 
>>>> here is a new version of the patch.
>>>> 
>>>> The cost model is able to deal with most of the cases where the second
>>>> operand of a SHL is of kind OK_UniformConstantValue.
>>>> As far as I can see, the only missing rule is for the case where we
>>>> have a v16i16 SHL in AVX2.
>>>> I added a rule in X86TTI::getArithmeticInstrCost to update the cost
>>>> for that specific case.
>>>> 
>>>> One of the problems I have encountered is that method 'getOperandInfo'
>>>> in CostModel either returns OK_AnyValue or OK_UniformConstantValue (I
>>>> didn't find any case where we return OK_UniformValue..).
>>>> All the shifts optimized by my patch are shifts where the second
>>>> operand is a ConstantDataVector but not a splat. Therefore, function
>>>> getOperandInfo in CostModel.cpp will always return OK_AnyValue for
>>>> them.
>>>> 
>>>> In conclusion, I am not sure if I have modified that part correctly.
>>>> Please let me know if you had something else in mind and in case what
>>>> should be the correct way to improve that part.
>>>> 
>>>> I investigated other opportunties for applying this transformation to
>>>> other vector types.
>>>> This new patch improves my previous patch since we now know how to
>>>> expand a packed v16i16 shift left into a single AVX2 vpmullw when the
>>>> shift amount is a constant build_vector.
>>>> 
>>>> Without AVX2 support, a packed v16i16 shift left is usually decomposed
>>>> into two 128-bits shifts. Each new shift is then properly expanded
>>>> into a multiply instruction (pmullw) thanks to the new transformation
>>>> introduced by this patch.
>>>> 
>>>> The backend already knows how to efficiently lower v8i32 shifts with
>>>> AVX2: v8i32 shifts are expanded into VPSLLV instructions.
>>>> Also, the backend already knows how to emit the AVX512f version of
>>>> VPSLLVD and VPSLLVQ in the case of v1632 and v8i64 vectors.
>>>> AVX512f seems to benefit alot from this new transformation (you can
>>>> see it if you compare the output of test avx_shift6.ll when run for
>>>> AVX512 before and after applying my patch).
>>>> 
>>>> A vector shift left by constant v32i16 build_vector is initally split
>>>> into two v16i16 shifts during type-legalization. Eventually the new
>>>> algorithm converts the resulting shifts into multiply instructions.
>>>> 
>>>> See new test-cases avx_shift6.ll for more details.
>>>> 
>>>> Please let me know what you think.
>>>> 
>>>> Thanks,
>>>> Andrea
>>>> 
>>>> On Thu, Feb 6, 2014 at 11:11 PM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>> Hi Jim and Nadav,
>>>>> 
>>>>> thanks for the feedback!
>>>>> 
>>>>> On Thu, Feb 6, 2014 at 10:33 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>>>> Please remember to update the x86 cost model in  X86TTI::getArithmeticInstrCost.  In this function you should be able to check if the LHS is a constant.
>>>>> 
>>>>> Ok, I will update the cost model.
>>>>> 
>>>>>> On Feb 6, 2014, at 2:26 PM, Jim Grosbach <grosbach at apple.com> wrote:
>>>>>> 
>>>>>>> Hi Andrea,
>>>>>>> 
>>>>>>> This is a very nice improvement, but should do a bit more, I believe.
>>>>>>> 
>>>>>>> AVX2 adds 256-bit wide vector versions of these instructions, so if AVX2 is available, the same transformation should be applied to v16i16 and v8i32 shifts. Worth looking to see if AVX512 extends
>>>>>>> 
>>>>>>> The test cases should check that when compiling for AVX, the VEX prefixed form of the instructions are generated instead of the SSE versions.
>>>>>>> 
>>>>> 
>>>>> Sure, I will send a new version of the patch that also improves AVX2
>>>>> code generation in the case of v16i16 and v8i32. I'll also have a look
>>>>> at AVX512 to identify cases that can also be improved.
>>>>> 
>>>>> On my previous email I forgot to mention that this patch would also
>>>>> fix bug 18478 (http://llvm.org/bugs/show_bug.cgi?id=18478)
>>>>> 
>>>>> Thanks again for your time.
>>>>> Andrea
>>>>> 
>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> This patch teaches the backend how to efficiently lower a packed
>>>>>>>> vector shift left into a packed vector multiply if the vector of shift
>>>>>>>> counts is known to be constant (i.e. a constant build_vector).
>>>>>>>> 
>>>>>>>> Instead of expanding a packed shift into a sequence of scalar shifts,
>>>>>>>> the backend should try (when possible) to convert the vector shift
>>>>>>>> into a vector multiply.
>>>>>>>> 
>>>>>>>> Before this patch, a shift of a MVT::v8i16 vector by a build_vector of
>>>>>>>> constants was always scalarized into a long sequence of "vector
>>>>>>>> extracts + scalar shifts + vector insert".
>>>>>>>> With this patch, if there is SSE2 support, we emit a single vector multiply.
>>>>>>>> 
>>>>>>>> The new x86 test 'vec_shift6.ll' contains some examples of code that
>>>>>>>> are affected by this patch.
>>>>>>>> 
>>>>>>>> Please let me know if ok to submit.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Andrea Di Biagio
>>>>>>>> SN Systems - Sony Computer Entertainment Group
>>>>>>>> <patch.diff>_______________________________________________
>>>>>>>> llvm-commits mailing list
>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>> 
>>>>>> 
>>> <patch-cost-model.diff><patch-v2.diff>
>> 
> <patch-cost-model-v2.diff><patch-v2.diff>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140212/11873076/attachment.html>