[llvm] r217744 - [x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD.

Mon Sep 15 03:43:38 PDT 2014

Really nice! Thanks for this patch.

On Mon, Sep 15, 2014 at 12:43 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
> Author: chandlerc
> Date: Sun Sep 14 18:43:33 2014
> New Revision: 217744
>
> URL: http://llvm.org/viewvc/llvm-project?rev=217744&view=rev
> Log:
> [x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD.
>
> These are super simple. They even take precedence over crazy
> instructions like INSERTPS because they have very high throughput on
> modern x86 chips.
>
> I still have to teach the integer shuffle variants about this to avoid
> so many domain crossings. However, due to the particular instructions
> available, that's a touch more complex and so a separate patch.
>
> Also, the backend doesn't seem to realize it can commute blend
> instructions by negating the mask. That would help remove a number of
> copies here. Suggestions on how to do this welcome, it's an area I'm
> less familiar with.

I guess you are referring to this test in particular:

 ; SSE41-LABEL: @shuffle_v4f32_4zzz
-; SSE41:         insertps {{.*}} # xmm0 = xmm0[0],zero,zero,zero
+; SSE41:         xorps %[[X:xmm[0-9]+]], %[[X]]
+; SSE41-NEXT:    blendps {{.*}} # [[X]] = xmm0[0],[[X]][1,2,3]
+; SSE41-NEXT:    movaps %[[X]], %xmm0
 ; SSE41-NEXT:    retq

If we commute the blendps then we can get rid of the extra movaps.

Also, in this test, I am not sure if a xorps+blendps would do better
than a single insertps. In case, we might want to check if one of the
operands to the blend is a vector of all-zeros and prefer a single
insertps instead of a xorps+blendps combo.

Incidentally, while looking at the new codegen, I noticed something
odd in the generated assembly of some hand written examples.

Example:
;;;;
  %vecext = extractelement <4 x float> %A, i32 1
  %vecinit = insertelement <4 x float> <float 0.0, float undef, float
undef, float undef>, float %vecext, i32 1
  %vecinit1 = insertelement <4 x float> %vecinit, float 0.0, i32 2
  %vecinit3 = shufflevector <4 x float> %vecinit1, <4 x float> %A, <4
x i32><i32 0, i32 1, i32 2, i32 7>
  ret <4 x float> %vecinit3
;;;;

The above IR is obtained from:
////
  __m128 foo(__m128 A) {
    return (__m128) {0.0f, A[1], 0.0f, A[3] };
  }
////

The good news is that your new shuffle lowering algorithm is
definitely improving the codegen of that example.
The bad news is that we still generate a sequence of
pshufd+insertps+blendps, while a single insertps would work in this
case:

 (with -mcpu=corei7 -x86-experimental-vector-shuffle-lowering)

  pshufd $4294967269, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3]
  insertps  $29, %xmm1, %xmm1 # %xmm1 = zero,xmm1[0],zero,zero
  blendps $8, %xmm0, %xmm1 # %xmm1 = xmm1[0,1,2],%xmm0[3]
  movaps %xmm1, %xmm0

Excluding the last 'movaps' which - as you said - could be removed if
we teach the backend how to commute the operands of a blendps,
I noticed two things:
 1) we use a pshufd instead of a shufps (this would cause a domain crossing);
 2) the shuffle mask of the pshufd seems a bit weird (the mask should
be 8-bits only. I am assuming that -27 was intended here).

Thanks again,
Andrea