[PATCH][x86] Teach how to combine a vselect into a movss/movsd.
Andrea Di Biagio
andrea.dibiagio at gmail.com
Mon Jan 20 11:22:20 PST 2014
Hi Nadav,
I attached a new version of the patch.
Rules are:
1. fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
2. fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
3. fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
4. fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
The differences with respect to the previous version are:
- the target specific combine on VSELECT nodes is now run after types
are legalized (i.e. !DCI.isBeforeLegalize()).
- I slightly simplified the algorithms (no if-stmt in the loops).
- I used std::swap as suggested by Juergen.
I also investigated whether it was possible to enable this new
transformation after DAG legalization.
However, The custom lowering of build_vector dag nodes changed the dag
sequence in a way that made it really hard to recognize my original
patterns.
In general, build_vector dag nodes used for the vselect Mask are
firstly expanded into a 'vector_shuffle' of constants and eventually
combined into either a X86ISD::VZEXT_MOVL or a 'bitcast of a load from
target constant pool'.
For simplicity, I eventually decided to enable the combine only after
types are legalized (i.e. `!DCI.isBeforeLegalize()`).
Please let me know what you think about this new version of the patch
and if ok to submit.
Thanks!
Andrea
On Fri, Jan 17, 2014 at 9:19 PM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi Nadav and Juergen,
>
> On Fri, Jan 17, 2014 at 8:16 PM, Nadav Rotem <nrotem at apple.com> wrote:
>> Thanks for working on this Andrea. The transformation itself is okay, but I am worried about problems that may show up if this optimization were to fire up too early before other optimizations have a chance to optimize this select. This is really a lowering transformation I mention this because very few optimizations can (or should have to) optimize x86 specific nodes. For example, maybe A and B could be optimized into constants at some point but this optimization would prevent us from doing anything about it. I suggest that you make sure that this optimization only runs after the operations are legalized.
>
> True, it is safer to run this after nodes are legalized.
>
> I'll change the patch so that the optimization runs after legalization.
> (I will also introduce the std::swap as suggested by Juergen).
>
> Thanks for the reviews!
> Andrea
>
>>
>> On Jan 16, 2014, at 5:42 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> this patch teaches the x86 backend how to combine vselect dag nodes
>>> into movss/movsd when possible.
>>>
>>> If the vector type of the operands of the vselect is either
>>> MVT::v4i32 or MVT::v4f32, then we can fold according to the following rules:
>>>
>>> 1. fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
>>> 2. fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
>>>
>>> If the vector type of the operands of the vselect is either
>>> MVT::v2i64 or MVT::v2f64 (and we have SSE2) , then we can fold
>>> according to the following rules:
>>>
>>> 3. fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
>>> 4. fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
>>>
>>> I added extra test cases to file 'test/CodeGen/X86/vselect.ll' in
>>> order to verify that we correctly select movss/movsd instructions.
>>>
>>> Before this change, the backend only knew how to lower a shufflevector
>>> into a X86Movss/X86Movsd, but not how to do the same with vselect dag
>>> nodes.
>>> For that reason, all the ISel patterns introduced at r197145
>>> http://llvm.org/viewvc/llvm-project?view=revision&revision=197145
>>> were only matched if the X86Movss/X86Movsd were obtained from the
>>> custom lowering of a shufflevector.
>>>
>>> With this change, the backend is now able to combine vselect into
>>> X86Movss and therefore it can reuse the patterns from revision 197145
>>> to further simplify packed vector arithmetic operations.
>>>
>>> I added new test-cases in 'test/CodeGen/X86/sse-scalar-fp-arith-2.ll'
>>> to verify that now we correctly select SSE/AVX scalar fp instructions
>>> from a packed arithmetic instruction followed by a vselect.
>>>
>>> After this change, the following tests started failing because they
>>> always expected blendvps/blendvpd instructions in the output assembly:
>>> test/CodeGen/X86/sse2-blend.ll
>>> test/CodeGen/X86/avx-blend.ll
>>> test/CodeGen/X86/blend-msb.ll
>>> test/CodeGen/X86/sse41-blend.ll
>>>
>>> Now the backend knows how to efficiently emit movss/movsd and
>>> therefore all the failing cases are expected failures (that is because
>>> the backend knows how to select movss/movsd and not only
>>> blendvps/blendvpd).
>>>
>>> I modified those failing tests so that - when possible - the generated
>>> assembly still contains the expected blendvps/blendvpd(see for example
>>> how I changed avx-blend.ll).
>>> In all other cases I just changed the CHECK lines to verify that we
>>> produce a movss/movsd.
>>>
>>> Please let me know if ok to submit.
>>>
>>> Thanks,
>>> Andrea Di Biagio
>>> SN Systems - Sony Computer Entertainment Group.
>>> <patch-vselect.diff>
>>
-------------- next part --------------
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================
--- lib/Target/X86/X86ISelLowering.cpp (revision 199676)
+++ lib/Target/X86/X86ISelLowering.cpp (working copy)
@@ -17155,6 +17155,41 @@
}
}
+ // Try to fold this VSELECT into a MOVSS/MOVSD
+ if (N->getOpcode() == ISD::VSELECT &&
+ Cond.getOpcode() == ISD::BUILD_VECTOR && !DCI.isBeforeLegalize()) {
+ if (VT == MVT::v4i32 || VT == MVT::v4f32 ||
+ (Subtarget->hasSSE2() && (VT == MVT::v2i64 || VT == MVT::v2f64))) {
+ bool CanFold = false;
+ unsigned NumElems = Cond.getNumOperands();
+ SDValue A = LHS;
+ SDValue B = RHS;
+
+ if (isZero(Cond.getOperand(0))) {
+ CanFold = true;
+
+ // fold (vselect <0,-1,-1,-1>, A, B) -> (movss A, B)
+ // fold (vselect <0,-1> -> (movsd A, B)
+ for (unsigned i = 1, e = NumElems; i != e && CanFold; ++i)
+ CanFold = isAllOnes(Cond.getOperand(i));
+ } else if (isAllOnes(Cond.getOperand(0))) {
+ CanFold = true;
+ std::swap(A, B);
+
+ // fold (vselect <-1,0,0,0>, A, B) -> (movss B, A)
+ // fold (vselect <-1,0> -> (movsd B, A)
+ for (unsigned i = 1, e = NumElems; i != e && CanFold; ++i)
+ CanFold = isZero(Cond.getOperand(i));
+ }
+
+ if (CanFold) {
+ if (VT == MVT::v4i32 || VT == MVT::v4f32)
+ return getTargetShuffleNode(X86ISD::MOVSS, DL, VT, A, B, DAG);
+ return getTargetShuffleNode(X86ISD::MOVSD, DL, VT, A, B, DAG);
+ }
+ }
+ }
+
// If we know that this node is legal then we know that it is going to be
// matched by one of the SSE/AVX BLEND instructions. These instructions only
// depend on the highest bit in each word. Try to use SimplifyDemandedBits
Index: test/CodeGen/X86/sse41-blend.ll
===================================================================
--- test/CodeGen/X86/sse41-blend.ll (revision 199676)
+++ test/CodeGen/X86/sse41-blend.ll (working copy)
@@ -4,7 +4,7 @@
;CHECK: blendvps
;CHECK: ret
define <4 x float> @vsel_float(<4 x float> %v1, <4 x float> %v2) {
- %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %v1, <4 x float> %v2
+ %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 true>, <4 x float> %v1, <4 x float> %v2
ret <4 x float> %vsel
}
@@ -13,7 +13,7 @@
;CHECK: blendvps
;CHECK: ret
define <4 x i8> @vsel_4xi8(<4 x i8> %v1, <4 x i8> %v2) {
- %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i8> %v1, <4 x i8> %v2
+ %vsel = select <4 x i1> <i1 true, i1 true, i1 false, i1 false>, <4 x i8> %v1, <4 x i8> %v2
ret <4 x i8> %vsel
}
@@ -21,7 +21,7 @@
;CHECK: blendvps
;CHECK: ret
define <4 x i16> @vsel_4xi16(<4 x i16> %v1, <4 x i16> %v2) {
- %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i16> %v1, <4 x i16> %v2
+ %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 true>, <4 x i16> %v1, <4 x i16> %v2
ret <4 x i16> %vsel
}
@@ -30,13 +30,13 @@
;CHECK: blendvps
;CHECK: ret
define <4 x i32> @vsel_i32(<4 x i32> %v1, <4 x i32> %v2) {
- %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i32> %v1, <4 x i32> %v2
+ %vsel = select <4 x i1> <i1 true, i1 true, i1 false, i1 false>, <4 x i32> %v1, <4 x i32> %v2
ret <4 x i32> %vsel
}
;CHECK-LABEL: vsel_double:
-;CHECK: blendvpd
+;CHECK: movsd
;CHECK: ret
define <4 x double> @vsel_double(<4 x double> %v1, <4 x double> %v2) {
%vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x double> %v1, <4 x double> %v2
@@ -45,7 +45,7 @@
;CHECK-LABEL: vsel_i64:
-;CHECK: blendvpd
+;CHECK: movsd
;CHECK: ret
define <4 x i64> @vsel_i64(<4 x i64> %v1, <4 x i64> %v2) {
%vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i64> %v1, <4 x i64> %v2
Index: test/CodeGen/X86/vselect.ll
===================================================================
--- test/CodeGen/X86/vselect.ll (revision 199676)
+++ test/CodeGen/X86/vselect.ll (working copy)
@@ -174,3 +174,91 @@
; CHECK-NOT: xorps
; CHECK: ret
+define <4 x float> @test18(<4 x float> %a, <4 x float> %b) {
+ %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %b
+ ret <4 x float> %1
+}
+; CHECK-LABEL: test18
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <4 x i32> @test19(<4 x i32> %a, <4 x i32> %b) {
+ %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x i32> %a, <4 x i32> %b
+ ret <4 x i32> %1
+}
+; CHECK-LABEL: test19
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <2 x double> @test20(<2 x double> %a, <2 x double> %b) {
+ %1 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %b
+ ret <2 x double> %1
+}
+; CHECK-LABEL: test20
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
+define <2 x i64> @test21(<2 x i64> %a, <2 x i64> %b) {
+ %1 = select <2 x i1> <i1 false, i1 true>, <2 x i64> %a, <2 x i64> %b
+ ret <2 x i64> %1
+}
+; CHECK-LABEL: test21
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
+define <4 x float> @test22(<4 x float> %a, <4 x float> %b) {
+ %1 = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %a, <4 x float> %b
+ ret <4 x float> %1
+}
+; CHECK-LABEL: test22
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <4 x i32> @test23(<4 x i32> %a, <4 x i32> %b) {
+ %1 = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i32> %a, <4 x i32> %b
+ ret <4 x i32> %1
+}
+; CHECK-LABEL: test23
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <2 x double> @test24(<2 x double> %a, <2 x double> %b) {
+ %1 = select <2 x i1> <i1 true, i1 false>, <2 x double> %a, <2 x double> %b
+ ret <2 x double> %1
+}
+; CHECK-LABEL: test24
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
+define <2 x i64> @test25(<2 x i64> %a, <2 x i64> %b) {
+ %1 = select <2 x i1> <i1 true, i1 false>, <2 x i64> %a, <2 x i64> %b
+ ret <2 x i64> %1
+}
+; CHECK-LABEL: test25
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
Index: test/CodeGen/X86/sse-scalar-fp-arith-2.ll
===================================================================
--- test/CodeGen/X86/sse-scalar-fp-arith-2.ll (revision 199676)
+++ test/CodeGen/X86/sse-scalar-fp-arith-2.ll (working copy)
@@ -213,3 +213,211 @@
; CHECK-NOT: movsd
; CHECK: ret
+
+define <4 x float> @test3_add_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fadd <4 x float> %a, %b
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_add_ss
+; SSE2: addss %xmm1, %xmm0
+; AVX: vaddss %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test3_sub_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fsub <4 x float> %a, %b
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_sub_ss
+; SSE2: subss %xmm1, %xmm0
+; AVX: vsubss %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test3_mul_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fmul <4 x float> %a, %b
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_mul_ss
+; SSE2: mulss %xmm1, %xmm0
+; AVX: vmulss %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test3_div_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fdiv <4 x float> %a, %b
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_div_ss
+; SSE2: divss %xmm1, %xmm0
+; AVX: vdivss %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <2 x double> @test3_add_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fadd <2 x double> %a, %b
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_add_sd
+; SSE2: addsd %xmm1, %xmm0
+; AVX: vaddsd %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test3_sub_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fsub <2 x double> %a, %b
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_sub_sd
+; SSE2: subsd %xmm1, %xmm0
+; AVX: vsubsd %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test3_mul_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fmul <2 x double> %a, %b
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_mul_sd
+; SSE2: mulsd %xmm1, %xmm0
+; AVX: vmulsd %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test3_div_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fdiv <2 x double> %a, %b
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_div_sd
+; SSE2: divsd %xmm1, %xmm0
+; AVX: vdivsd %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <4 x float> @test4_add_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fadd <4 x float> %b, %a
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_add_ss
+; SSE2: addss %xmm0, %xmm1
+; AVX: vaddss %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test4_sub_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fsub <4 x float> %b, %a
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_sub_ss
+; SSE2: subss %xmm0, %xmm1
+; AVX: vsubss %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test4_mul_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fmul <4 x float> %b, %a
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_mul_ss
+; SSE2: mulss %xmm0, %xmm1
+; AVX: vmulss %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test4_div_ss(<4 x float> %a, <4 x float> %b) {
+ %1 = fdiv <4 x float> %b, %a
+ %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+ ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_div_ss
+; SSE2: divss %xmm0, %xmm1
+; AVX: vdivss %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <2 x double> @test4_add_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fadd <2 x double> %b, %a
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_add_sd
+; SSE2: addsd %xmm0, %xmm1
+; AVX: vaddsd %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test4_sub_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fsub <2 x double> %b, %a
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_sub_sd
+; SSE2: subsd %xmm0, %xmm1
+; AVX: vsubsd %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test4_mul_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fmul <2 x double> %b, %a
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_mul_sd
+; SSE2: mulsd %xmm0, %xmm1
+; AVX: vmulsd %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test4_div_sd(<2 x double> %a, <2 x double> %b) {
+ %1 = fdiv <2 x double> %b, %a
+ %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+ ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_div_sd
+; SSE2: divsd %xmm0, %xmm1
+; AVX: vdivsd %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
Index: test/CodeGen/X86/avx-blend.ll
===================================================================
--- test/CodeGen/X86/avx-blend.ll (revision 199676)
+++ test/CodeGen/X86/avx-blend.ll (working copy)
@@ -6,7 +6,7 @@
;CHECK: vblendvps
;CHECK: ret
define <4 x float> @vsel_float(<4 x float> %v1, <4 x float> %v2) {
- %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %v1, <4 x float> %v2
+ %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 false>, <4 x float> %v1, <4 x float> %v2
ret <4 x float> %vsel
}
@@ -15,13 +15,13 @@
;CHECK: vblendvps
;CHECK: ret
define <4 x i32> @vsel_i32(<4 x i32> %v1, <4 x i32> %v2) {
- %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i32> %v1, <4 x i32> %v2
+ %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 false>, <4 x i32> %v1, <4 x i32> %v2
ret <4 x i32> %vsel
}
;CHECK-LABEL: vsel_double:
-;CHECK: vblendvpd
+;CHECK: vmovsd
;CHECK: ret
define <2 x double> @vsel_double(<2 x double> %v1, <2 x double> %v2) {
%vsel = select <2 x i1> <i1 true, i1 false>, <2 x double> %v1, <2 x double> %v2
@@ -30,7 +30,7 @@
;CHECK-LABEL: vsel_i64:
-;CHECK: vblendvpd
+;CHECK: vmovsd
;CHECK: ret
define <2 x i64> @vsel_i64(<2 x i64> %v1, <2 x i64> %v2) {
%vsel = select <2 x i1> <i1 true, i1 false>, <2 x i64> %v1, <2 x i64> %v2
Index: test/CodeGen/X86/blend-msb.ll
===================================================================
--- test/CodeGen/X86/blend-msb.ll (revision 199676)
+++ test/CodeGen/X86/blend-msb.ll (working copy)
@@ -1,13 +1,11 @@
; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=corei7 -mattr=+sse4.1 | FileCheck %s
-; In this test we check that sign-extend of the mask bit is performed by
-; shifting the needed bit to the MSB, and not using shl+sra.
+; Verify that we produce movss instead of blendvps when possible.
;CHECK-LABEL: vsel_float:
-;CHECK: movl $-1
-;CHECK-NEXT: movd
-;CHECK-NEXT: blendvps
+;CHECK-NOT: blendvps
+;CHECK: movss
;CHECK: ret
define <4 x float> @vsel_float(<4 x float> %v1, <4 x float> %v2) {
%vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %v1, <4 x float> %v2
@@ -15,9 +13,8 @@
}
;CHECK-LABEL: vsel_4xi8:
-;CHECK: movl $-1
-;CHECK-NEXT: movd
-;CHECK-NEXT: blendvps
+;CHECK-NOT: blendvps
+;CHECK: movss
;CHECK: ret
define <4 x i8> @vsel_4xi8(<4 x i8> %v1, <4 x i8> %v2) {
%vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i8> %v1, <4 x i8> %v2
Index: test/CodeGen/X86/sse2-blend.ll
===================================================================
--- test/CodeGen/X86/sse2-blend.ll (revision 199676)
+++ test/CodeGen/X86/sse2-blend.ll (working copy)
@@ -1,9 +1,9 @@
; RUN: llc < %s -march=x86 -mcpu=yonah -mattr=+sse2,-sse4.1 | FileCheck %s
-; CHECK: vsel_float
-; CHECK: xorps
+; CHECK-LABEL: vsel_float
+; CHECK-NOT: xorps
; CHECK: movss
-; CHECK: orps
+; CHECK-NOT: orps
; CHECK: ret
define void at vsel_float(<4 x float>* %v1, <4 x float>* %v2) {
%A = load <4 x float>* %v1
@@ -13,10 +13,17 @@
ret void
}
-; CHECK: vsel_i32
-; CHECK: xorps
+define <4 x i32> @foo(<4 x i32> %v1, <4 x i32> %v2) {
+ %and1 = and <4 x i32> %v1, <i32 -1, i32 0, i32 0, i32 0>
+ %and2 = and <4 x i32> %v2, <i32 0, i32 -1, i32 -1, i32 -1>
+ %result = or <4 x i32> %and1, %and2
+ ret <4 x i32> %result
+}
+
+; CHECK-LABEL: vsel_i32
+; CHECK-NOT: xorps
; CHECK: movss
-; CHECK: orps
+; CHECK-NOT: orps
; CHECK: ret
define void at vsel_i32(<4 x i32>* %v1, <4 x i32>* %v2) {
%A = load <4 x i32>* %v1
@@ -27,7 +34,7 @@
}
; Without forcing instructions, fall back to the preferred PS domain.
-; CHECK: vsel_i64
+; CHECK-LABEL: vsel_i64
; CHECK: andnps
; CHECK: orps
; CHECK: ret
@@ -41,7 +48,7 @@
}
; Without forcing instructions, fall back to the preferred PS domain.
-; CHECK: vsel_double
+; CHECK-LABEL: vsel_double
; CHECK: andnps
; CHECK: orps
; CHECK: ret
More information about the llvm-commits
mailing list