[llvm] r243395 - [X86][SSE] Use bitmasks instead of shuffles where possible.

Tue Jul 28 03:13:57 PDT 2015

Nice patch Simon!

Please see my comment inline.

On Tue, Jul 28, 2015 at 9:54 AM, Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:

> Author: rksimon
> Date: Tue Jul 28 03:54:41 2015
> New Revision: 243395
>
> URL: http://llvm.org/viewvc/llvm-project?rev=243395&view=rev
> Log:
> [X86][SSE] Use bitmasks instead of shuffles where possible.
>
> VPAND is a lot faster than VPSHUFB and VPBLENDVB - this patch ensures we
> attempt to lower to a basic bitmask before lowering to the slower byte
> shuffle/blend instructions.
>
> Split off from D11518.
>
> Differential Revision: http://reviews.llvm.org/D11541
>
> Modified:
>     llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
>     llvm/trunk/test/CodeGen/X86/vector-shuffle-128-v16.ll
>     llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll
>     llvm/trunk/test/CodeGen/X86/vector-zext.ll
>
> <snip>

Modified: llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll
> URL:
> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll?rev=243395&r1=243394&r2=243395&view=diff
>
> ==============================================================================
> --- llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll (original)
> +++ llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll Tue Jul 28
> 03:54:41 2015
> @@ -951,17 +951,15 @@ define <32 x i8> @shuffle_v32i8_zz_01_zz
>  ; AVX1-LABEL:
> shuffle_v32i8_zz_01_zz_03_zz_05_zz_07_zz_09_zz_11_zz_13_zz_15_zz_17_zz_19_zz_21_zz_23_zz_25_zz_27_zz_29_zz_31:
>  ; AVX1:       # BB#0:
>  ; AVX1-NEXT:    vextractf128 $1, %ymm0, %xmm1
> -; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 =
> [128,1,128,3,128,5,128,7,128,9,128,11,128,13,128,15]
> -; AVX1-NEXT:    vpshufb %xmm2, %xmm1, %xmm1
> -; AVX1-NEXT:    vpshufb %xmm2, %xmm0, %xmm0
> +; AVX1-NEXT:    vmovaps {{.*#+}} xmm2 =
> [0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255]
> +; AVX1-NEXT:    vandps %xmm2, %xmm1, %xmm1
> +; AVX1-NEXT:    vandps %xmm2, %xmm0, %xmm0
>  ; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
>  ; AVX1-NEXT:    retq
>  ;
>

In this test, we end up with a long sequence of instructions (on AVX1)
because the 32Byte shuffle is expanded into two 16Byte shuffles plus an
insert_subvector (which eventually becomes a vinsertf128). However, on
AVX1, extract/insert subvector nodes from/to 32byte vectors can only be
expanded to vextractf128/vinsertf128 instructions. So (if my understanding
is correct) there is no way (on AVX1) to keep the computation in the
integer domain.

Ideally, on AVX1 we should have a single 'vandps' (similarly to what we
currently do for AVX2) instead of a
'vmovaps+vextractf+vandps+vandps+vinsertf'. What do you think?

 ; AVX2-LABEL:
> shuffle_v32i8_zz_01_zz_03_zz_05_zz_07_zz_09_zz_11_zz_13_zz_15_zz_17_zz_19_zz_21_zz_23_zz_25_zz_27_zz_29_zz_31:
>  ; AVX2:       # BB#0:
> -; AVX2-NEXT:    vmovdqa {{.*#+}} ymm1 =
> [255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0,255,0]
> -; AVX2-NEXT:    vpxor %ymm2, %ymm2, %ymm2
> -; AVX2-NEXT:    vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
> +; AVX2-NEXT:    vandps {{.*}}(%rip), %ymm0, %ymm0
>  ; AVX2-NEXT:    retq
>    %shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x
> i32> <i32 32, i32 1, i32 34, i32 3, i32 36, i32 5, i32 38, i32 7, i32 40,
> i32 9, i32 42, i32 11, i32 44, i32 13, i32 46, i32 15, i32 48, i32 17, i32
> 50, i32 19, i32 52, i32 21, i32 54, i32 23, i32 56, i32 25, i32 58, i32 27,
> i32 60, i32 29, i32 62, i32 31>
>    ret <32 x i8> %shuffle
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150728/e1140a3a/attachment.html>