[PATCH][X86] Teach the backend how to lower vector shift left into multiply rather than scalarizing it.

Tue Feb 11 19:34:26 PST 2014

Hi Andrea, 

You added a new kind of OperandValueKind and you connected the cost model computation, but you forgot one big piece: the vectorizer!   The vectorizer needs to use the new OperandValueKind in order to efficiently estimate the cost of the shifts.  The patches you sent LGTM, but I would also like the vectorizer to use the new operand value kind.  :)

Thanks,
Nadav

On Feb 11, 2014, at 12:29 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> Hi Nadav and Jim,
> 
> Sorry for sending another version of the patch but I think I found how
> to properly fix/improve the cost model.
> For simplicity I have split the patch into two: a patch that
> introduces the new lowering rule for vector shifts and a patch that
> improves the x86 cost model.
> 
> Test vec_shift6.ll should now cover all the cases where vector shifts
> are expanded into multiply instructions.
> 
> I introduced into the cost model a new 'OperandValueKind' to identify
> operands which are constants but not constant splats.
> I called it 'OK_NonUniformConstValue' to distinguish it from
> 'OK_UniformConstantValue' which is used for splat values.
> 
> I modified CostModel.cpp to allow returning 'OK_NonUniformConstValue'
> for non splat operands that are instances of ConstantVector or
> ConstantDataVector.
> 
> I verified that the cost model still produces the expected results on
> other non x86 targets (ARM and PPC).
> On the X86 backend, I modified X86TargetTransformInfo to produce the
> 'expected' cost for vector shift left instructions that can be lowered
> as a vector multiply.
> 
> Finally I added a new test to verify that the output of opt
> '-cost-model -analyze' is valid in the following configurations: SSE2,
> SSE4.1, AVX, AVX2.
> 
> I didn't add a cost model for AVX512f, but I think (if it is ok for
> you) that this can be done as a separate patch.
> 
> Please let me know what you think.
> 
> Thanks,
> Andrea
> 
> On Fri, Feb 7, 2014 at 8:51 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>> Hi Jim and Nadav,
>> 
>> here is a new version of the patch.
>> 
>> The cost model is able to deal with most of the cases where the second
>> operand of a SHL is of kind OK_UniformConstantValue.
>> As far as I can see, the only missing rule is for the case where we
>> have a v16i16 SHL in AVX2.
>> I added a rule in X86TTI::getArithmeticInstrCost to update the cost
>> for that specific case.
>> 
>> One of the problems I have encountered is that method 'getOperandInfo'
>> in CostModel either returns OK_AnyValue or OK_UniformConstantValue (I
>> didn't find any case where we return OK_UniformValue..).
>> All the shifts optimized by my patch are shifts where the second
>> operand is a ConstantDataVector but not a splat. Therefore, function
>> getOperandInfo in CostModel.cpp will always return OK_AnyValue for
>> them.
>> 
>> In conclusion, I am not sure if I have modified that part correctly.
>> Please let me know if you had something else in mind and in case what
>> should be the correct way to improve that part.
>> 
>> I investigated other opportunties for applying this transformation to
>> other vector types.
>> This new patch improves my previous patch since we now know how to
>> expand a packed v16i16 shift left into a single AVX2 vpmullw when the
>> shift amount is a constant build_vector.
>> 
>> Without AVX2 support, a packed v16i16 shift left is usually decomposed
>> into two 128-bits shifts. Each new shift is then properly expanded
>> into a multiply instruction (pmullw) thanks to the new transformation
>> introduced by this patch.
>> 
>> The backend already knows how to efficiently lower v8i32 shifts with
>> AVX2: v8i32 shifts are expanded into VPSLLV instructions.
>> Also, the backend already knows how to emit the AVX512f version of
>> VPSLLVD and VPSLLVQ in the case of v1632 and v8i64 vectors.
>> AVX512f seems to benefit alot from this new transformation (you can
>> see it if you compare the output of test avx_shift6.ll when run for
>> AVX512 before and after applying my patch).
>> 
>> A vector shift left by constant v32i16 build_vector is initally split
>> into two v16i16 shifts during type-legalization. Eventually the new
>> algorithm converts the resulting shifts into multiply instructions.
>> 
>> See new test-cases avx_shift6.ll for more details.
>> 
>> Please let me know what you think.
>> 
>> Thanks,
>> Andrea
>> 
>> On Thu, Feb 6, 2014 at 11:11 PM, Andrea Di Biagio
>> <andrea.dibiagio at gmail.com> wrote:
>>> Hi Jim and Nadav,
>>> 
>>> thanks for the feedback!
>>> 
>>> On Thu, Feb 6, 2014 at 10:33 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>> Please remember to update the x86 cost model in  X86TTI::getArithmeticInstrCost.  In this function you should be able to check if the LHS is a constant.
>>> 
>>> Ok, I will update the cost model.
>>> 
>>>> On Feb 6, 2014, at 2:26 PM, Jim Grosbach <grosbach at apple.com> wrote:
>>>> 
>>>>> Hi Andrea,
>>>>> 
>>>>> This is a very nice improvement, but should do a bit more, I believe.
>>>>> 
>>>>> AVX2 adds 256-bit wide vector versions of these instructions, so if AVX2 is available, the same transformation should be applied to v16i16 and v8i32 shifts. Worth looking to see if AVX512 extends
>>>>> 
>>>>> The test cases should check that when compiling for AVX, the VEX prefixed form of the instructions are generated instead of the SSE versions.
>>>>> 
>>> 
>>> Sure, I will send a new version of the patch that also improves AVX2
>>> code generation in the case of v16i16 and v8i32. I'll also have a look
>>> at AVX512 to identify cases that can also be improved.
>>> 
>>> On my previous email I forgot to mention that this patch would also
>>> fix bug 18478 (http://llvm.org/bugs/show_bug.cgi?id=18478)
>>> 
>>> Thanks again for your time.
>>> Andrea
>>> 
>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> This patch teaches the backend how to efficiently lower a packed
>>>>>> vector shift left into a packed vector multiply if the vector of shift
>>>>>> counts is known to be constant (i.e. a constant build_vector).
>>>>>> 
>>>>>> Instead of expanding a packed shift into a sequence of scalar shifts,
>>>>>> the backend should try (when possible) to convert the vector shift
>>>>>> into a vector multiply.
>>>>>> 
>>>>>> Before this patch, a shift of a MVT::v8i16 vector by a build_vector of
>>>>>> constants was always scalarized into a long sequence of "vector
>>>>>> extracts + scalar shifts + vector insert".
>>>>>> With this patch, if there is SSE2 support, we emit a single vector multiply.
>>>>>> 
>>>>>> The new x86 test 'vec_shift6.ll' contains some examples of code that
>>>>>> are affected by this patch.
>>>>>> 
>>>>>> Please let me know if ok to submit.
>>>>>> 
>>>>>> Thanks,
>>>>>> Andrea Di Biagio
>>>>>> SN Systems - Sony Computer Entertainment Group
>>>>>> <patch.diff>_______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>> 
>>>> 
> <patch-cost-model.diff><patch-v2.diff>