[PATCH][X86] Improve the lowering of packed shifts by constant build_vector on non avx2 machines.

Jim Grosbach grosbach at apple.com
Tue Apr 15 10:06:44 PDT 2014


Hi Andrea,

This looks good to me. In fact, it significantly improves code for several of those test cases for both older (core2) and newer (Haswell) processors as well. This is all around goodness. Would you mind adding CHECK lines for SSE and AVX2 codegen to the test cases you’ve added? I’d hate to see us regress on these improvements because we get misled by the tests into thinking they’re only relevant to AVX1.

-Jim

On Apr 15, 2014, at 8:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> ping x2.
> 
> Thanks,
> Andrea Di Biagio
> 
> On Tue, Apr 8, 2014 at 3:38 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>> ping.
>> 
>> 
>> On Tue, Apr 1, 2014 at 2:49 AM, Andrea Di Biagio
>> <andrea.dibiagio at gmail.com> wrote:
>>> Hi,
>>> 
>>> This patch teaches the backend how to efficiently lower
>>> logical/arithmetic packed shift nodes by constant build_vector on
>>> non-avx2 machines.
>>> 
>>> The x86 backend already knows how to efficiently lower a packed shift
>>> left by a constant build_vector into a vector multiply (instead of
>>> lowering it into a long sequence of scalar shifts).
>>> However, nothing is currently done in the case of other shifts by
>>> constant build_vector.
>>> 
>>> This patch teaches the backend how to lower v4i32 and v8i16 shifts
>>> according to the following rules:
>>> 
>>> 1.   VSHIFT  (v4i32 A), (build_vector <X, Y, Y, Y>)  --> MOVSS (VSHIFT
>>> A, (build_vector <Y,Y,Y,Y>)), (VSHIFT A, (build_vector <X,X,X,X>))
>>> 2.   VSHIFT  (v4i32 A), (build_vector <X, X, Y, Y>)  -->  (bitcast (
>>> MOVSD (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y>)), v2i64), (bitcast
>>> (VSHIFT A, (build_vector <X,X,X,X>)), v2i64), v4i32)
>>> 3.   VSHIFT  (v8i16 A), (build_vector <X, X, Y, Y, Y,Y,Y,Y>)  -->
>>> (bitcast (MOVSS (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y,Y,Y,Y,Y>)),
>>> v4i32), (bitcast (VSHIFT A, (build_vector <X,X,X,X,X,X,X,X>)),
>>> v4i32)), v8i16)
>>> 4.   VSHIFT  (v8i16 A), (build_vector <X, X, X, X,Y,Y,Y,Y>)  -->
>>> (bitcast (MOVSD (bitcast (VSHIFT A, (build_vector<Y,Y,Y,Y,Y,Y,Y,Y>)),
>>> v2i64), (bitcast (VSHIFT A, (build_vector <X,X,X,X,X,X,X,X>)),
>>> v2i64)), v8i16)
>>> 
>>> Basically,
>>> instead of scalarizing a vector shift, we try to expand it into a
>>> sequence of two shifts by constant splat followed by a MOVSS/MOVSD.
>>> 
>>> The following example:
>>> ///--
>>> define <8 x i16> @foo(<8 x i16> %a) {
>>>  %lshr = lshr <8 x i16> %a, <i16 3, i16 3, i16 2, i16 2, i16 2, i16
>>> 2, i16 2, i16 2>
>>>  ret <8 x i16> %lshr
>>> }
>>> ///--
>>> 
>>> Before this patch, for targets with no AVX2 support, the backend
>>> always scalarized the logical packed shift right in function @foo into
>>> a very long sequence of instructions (8 scalar shifts + 8 inserts + 8
>>> extracts).
>>> With this patch, the backend produces only three instructions (here is
>>> the output on a -mcpu=corei7-avx):
>>>   vpsrlw $2, %xmm0, %xmm1
>>>   vpsrlw $3  %xmm0, %xmm0
>>>   vmovss %xmm0, %xmm1, %xmm0
>>>   retq
>>> 
>>> Other examples can be found in the new test test/CodeGen/X86/lower-vec-shift.ll.
>>> 
>>> For now, I decided to simply cover the above mentioned cases.
>>> However, future patches could address even more patterns taking
>>> advantage of other legal ways to do a blend of two vector shifts.
>>> 
>>> Please let me know if ok to submit.
>>> 
>>> 
>>> Thanks!
>>> Andrea Di Biagio
>>> SN Systems - Sony Computer Entertainment Group
> <patch-lower-vector-shifts.diff>





More information about the llvm-commits mailing list