[llvm] r217744 - [x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD.
Andrea Di Biagio
andrea.dibiagio at gmail.com
Mon Sep 15 03:43:38 PDT 2014
Really nice! Thanks for this patch.
On Mon, Sep 15, 2014 at 12:43 AM, Chandler Carruth <chandlerc at gmail.com> wrote:
> Author: chandlerc
> Date: Sun Sep 14 18:43:33 2014
> New Revision: 217744
>
> URL: http://llvm.org/viewvc/llvm-project?rev=217744&view=rev
> Log:
> [x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD.
>
> These are super simple. They even take precedence over crazy
> instructions like INSERTPS because they have very high throughput on
> modern x86 chips.
>
> I still have to teach the integer shuffle variants about this to avoid
> so many domain crossings. However, due to the particular instructions
> available, that's a touch more complex and so a separate patch.
>
> Also, the backend doesn't seem to realize it can commute blend
> instructions by negating the mask. That would help remove a number of
> copies here. Suggestions on how to do this welcome, it's an area I'm
> less familiar with.
I guess you are referring to this test in particular:
; SSE41-LABEL: @shuffle_v4f32_4zzz
-; SSE41: insertps {{.*}} # xmm0 = xmm0[0],zero,zero,zero
+; SSE41: xorps %[[X:xmm[0-9]+]], %[[X]]
+; SSE41-NEXT: blendps {{.*}} # [[X]] = xmm0[0],[[X]][1,2,3]
+; SSE41-NEXT: movaps %[[X]], %xmm0
; SSE41-NEXT: retq
If we commute the blendps then we can get rid of the extra movaps.
Also, in this test, I am not sure if a xorps+blendps would do better
than a single insertps. In case, we might want to check if one of the
operands to the blend is a vector of all-zeros and prefer a single
insertps instead of a xorps+blendps combo.
Incidentally, while looking at the new codegen, I noticed something
odd in the generated assembly of some hand written examples.
Example:
;;;;
%vecext = extractelement <4 x float> %A, i32 1
%vecinit = insertelement <4 x float> <float 0.0, float undef, float
undef, float undef>, float %vecext, i32 1
%vecinit1 = insertelement <4 x float> %vecinit, float 0.0, i32 2
%vecinit3 = shufflevector <4 x float> %vecinit1, <4 x float> %A, <4
x i32><i32 0, i32 1, i32 2, i32 7>
ret <4 x float> %vecinit3
;;;;
The above IR is obtained from:
////
__m128 foo(__m128 A) {
return (__m128) {0.0f, A[1], 0.0f, A[3] };
}
////
The good news is that your new shuffle lowering algorithm is
definitely improving the codegen of that example.
The bad news is that we still generate a sequence of
pshufd+insertps+blendps, while a single insertps would work in this
case:
(with -mcpu=corei7 -x86-experimental-vector-shuffle-lowering)
pshufd $4294967269, %xmm0, %xmm1 # xmm1 = xmm0[1,1,2,3]
insertps $29, %xmm1, %xmm1 # %xmm1 = zero,xmm1[0],zero,zero
blendps $8, %xmm0, %xmm1 # %xmm1 = xmm1[0,1,2],%xmm0[3]
movaps %xmm1, %xmm0
Excluding the last 'movaps' which - as you said - could be removed if
we teach the backend how to commute the operands of a blendps,
I noticed two things:
1) we use a pshufd instead of a shufps (this would cause a domain crossing);
2) the shuffle mask of the pshufd seems a bit weird (the mask should
be 8-bits only. I am assuming that -27 was intended here).
Thanks again,
Andrea
More information about the llvm-commits
mailing list