[PATCH][X86] Improve the lowering of packed shifts by constant build_vector on non avx2 machines.

Tue Apr 15 12:40:10 PDT 2014

Thanks for the reviews.
Committed at revision 206316.

On Tue, Apr 15, 2014 at 7:18 PM, Nadav Rotem <nrotem at apple.com> wrote:
> LGTM. Thanks Jim and Andrea.
>
> On Apr 15, 2014, at 11:14 AM, Jim Grosbach <grosbach at apple.com> wrote:
>
>> LGTM. Nadav, do you have additional feedback?
>>
>> -Jim
>>
>> On Apr 15, 2014, at 11:05 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>
>>> Hi Jim,
>>>
>>> Here is a new version of the patch.
>>> I added two more RUN lines to specifically test 'core2' (SSE) and
>>> core-avx2 (AVX2).
>>>
>>> Please let me know if ok to submit.
>>>
>>> Thanks!
>>> Andrea
>>>
>>> On Tue, Apr 15, 2014 at 6:12 PM, Andrea Di Biagio
>>> <andrea.dibiagio at gmail.com> wrote:
>>>> Hi Jim,
>>>> thanks for the review!
>>>>
>>>> On Tue, Apr 15, 2014 at 6:06 PM, Jim Grosbach <grosbach at apple.com> wrote:
>>>>> Hi Andrea,
>>>>>
>>>>> This looks good to me. In fact, it significantly improves code for several of those test cases for both older (core2) and newer (Haswell) processors as well. This is all around goodness. Would you mind adding CHECK lines for SSE and AVX2 codegen to the test cases you’ve added? I’d hate to see us regress on these improvements because we get misled by the tests into thinking they’re only relevant to AVX1.
>>>>
>>>> Sure no problem, I am going to add more RUN lines to check SSE and AVX2 as well.
>>>> I will upload a new patch soon.
>>>>
>>>> Cheers,
>>>> Andrea
>>>>
>>>>>
>>>>> -Jim
>>>>>
>>>>> On Apr 15, 2014, at 8:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>>>>
>>>>>> ping x2.
>>>>>>
>>>>>> Thanks,
>>>>>> Andrea Di Biagio
>>>>>>
>>>>>> On Tue, Apr 8, 2014 at 3:38 PM, Andrea Di Biagio
>>>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>>>> ping.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 1, 2014 at 2:49 AM, Andrea Di Biagio
>>>>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This patch teaches the backend how to efficiently lower
>>>>>>>> logical/arithmetic packed shift nodes by constant build_vector on
>>>>>>>> non-avx2 machines.
>>>>>>>>
>>>>>>>> The x86 backend already knows how to efficiently lower a packed shift
>>>>>>>> left by a constant build_vector into a vector multiply (instead of
>>>>>>>> lowering it into a long sequence of scalar shifts).
>>>>>>>> However, nothing is currently done in the case of other shifts by
>>>>>>>> constant build_vector.
>>>>>>>>
>>>>>>>> This patch teaches the backend how to lower v4i32 and v8i16 shifts
>>>>>>>> according to the following rules:
>>>>>>>>
>>>>>>>> 1.   VSHIFT  (v4i32 A), (build_vector <X, Y, Y, Y>)  --> MOVSS (VSHIFT
>>>>>>>> A, (build_vector <Y,Y,Y,Y>)), (VSHIFT A, (build_vector <X,X,X,X>))
>>>>>>>> 2.   VSHIFT  (v4i32 A), (build_vector <X, X, Y, Y>)  -->  (bitcast (
>>>>>>>> MOVSD (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y>)), v2i64), (bitcast
>>>>>>>> (VSHIFT A, (build_vector <X,X,X,X>)), v2i64), v4i32)
>>>>>>>> 3.   VSHIFT  (v8i16 A), (build_vector <X, X, Y, Y, Y,Y,Y,Y>)  -->
>>>>>>>> (bitcast (MOVSS (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y,Y,Y,Y,Y>)),
>>>>>>>> v4i32), (bitcast (VSHIFT A, (build_vector <X,X,X,X,X,X,X,X>)),
>>>>>>>> v4i32)), v8i16)
>>>>>>>> 4.   VSHIFT  (v8i16 A), (build_vector <X, X, X, X,Y,Y,Y,Y>)  -->
>>>>>>>> (bitcast (MOVSD (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y,Y,Y,Y,Y>)),
>>>>>>>> v2i64), (bitcast (VSHIFT A, (build_vector <X,X,X,X,X,X,X,X>)),
>>>>>>>> v2i64)), v8i16)
>>>>>>>>
>>>>>>>> Basically,
>>>>>>>> instead of scalarizing a vector shift, we try to expand it into a
>>>>>>>> sequence of two shifts by constant splat followed by a MOVSS/MOVSD.
>>>>>>>>
>>>>>>>> The following example:
>>>>>>>> ///--
>>>>>>>> define <8 x i16> @foo(<8 x i16> %a) {
>>>>>>>> %lshr = lshr <8 x i16> %a, <i16 3, i16 3, i16 2, i16 2, i16 2, i16
>>>>>>>> 2, i16 2, i16 2>
>>>>>>>> ret <8 x i16> %lshr
>>>>>>>> }
>>>>>>>> ///--
>>>>>>>>
>>>>>>>> Before this patch, for targets with no AVX2 support, the backend
>>>>>>>> always scalarized the logical packed shift right in function @foo into
>>>>>>>> a very long sequence of instructions (8 scalar shifts + 8 inserts + 8
>>>>>>>> extracts).
>>>>>>>> With this patch, the backend produces only three instructions (here is
>>>>>>>> the output on a -mcpu=corei7-avx):
>>>>>>>> vpsrlw $2, %xmm0, %xmm1
>>>>>>>> vpsrlw $3  %xmm0, %xmm0
>>>>>>>> vmovss %xmm0, %xmm1, %xmm0
>>>>>>>> retq
>>>>>>>>
>>>>>>>> Other examples can be found in the new test test/CodeGen/X86/lower-vec-shift.ll.
>>>>>>>>
>>>>>>>> For now, I decided to simply cover the above mentioned cases.
>>>>>>>> However, future patches could address even more patterns taking
>>>>>>>> advantage of other legal ways to do a blend of two vector shifts.
>>>>>>>>
>>>>>>>> Please let me know if ok to submit.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Andrea Di Biagio
>>>>>>>> SN Systems - Sony Computer Entertainment Group
>>>>>> <patch-lower-vector-shifts.diff>
>>>>>
>>> <patch-lower-vector-shifts.diff>
>>
>