[PATCH][x86] Teach how to combine a vselect into a movss/movsd.

Mon Jan 20 11:22:20 PST 2014

Hi Nadav,

I attached a new version of the patch.

Rules are:
 1.  fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
 2.  fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
 3.  fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
 4.  fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)

The differences with respect to the previous version are:
 - the target specific combine on VSELECT nodes is now run after types
are legalized (i.e. !DCI.isBeforeLegalize()).
 - I slightly simplified the algorithms (no if-stmt in the loops).
 - I used std::swap as suggested by Juergen.

I also investigated whether it was possible to enable this new
transformation after DAG legalization.
However, The custom lowering of build_vector dag nodes changed the dag
sequence in a way that made it really hard to recognize my original
patterns.
In general, build_vector dag nodes used for the vselect Mask are
firstly expanded into a 'vector_shuffle' of constants and eventually
combined into either a X86ISD::VZEXT_MOVL or a 'bitcast of a load from
target constant pool'.
For simplicity, I eventually decided to enable the combine only after
types are legalized (i.e. `!DCI.isBeforeLegalize()`).

Please let me know what you think about this new version of the patch
and if ok to submit.

Thanks!
Andrea

On Fri, Jan 17, 2014 at 9:19 PM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:
> Hi Nadav and Juergen,
>
> On Fri, Jan 17, 2014 at 8:16 PM, Nadav Rotem <nrotem at apple.com> wrote:
>> Thanks for working on this Andrea. The transformation itself is okay, but I am worried about problems that may show up if this optimization were to fire up too early before other optimizations have a chance to optimize this select. This is really a lowering transformation I mention this because very few optimizations can (or should have to) optimize x86 specific nodes. For example, maybe A and B could be optimized into constants at some point but this optimization would prevent us from doing anything about it.  I suggest that you make sure that this optimization only runs after the operations are legalized.
>
> True, it is safer to run this after nodes are legalized.
>
> I'll change the patch so that the optimization runs after legalization.
> (I will also introduce the std::swap as suggested by Juergen).
>
> Thanks for the reviews!
> Andrea
>
>>
>> On Jan 16, 2014, at 5:42 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> this patch teaches the x86 backend how to combine vselect dag nodes
>>> into movss/movsd when possible.
>>>
>>> If the vector type of the operands of the vselect is either
>>> MVT::v4i32 or MVT::v4f32, then we can fold according to the following rules:
>>>
>>> 1.  fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
>>> 2.  fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
>>>
>>> If the vector type of the operands of the vselect is either
>>> MVT::v2i64 or MVT::v2f64 (and we have SSE2) , then we can fold
>>> according to the following rules:
>>>
>>>  3.  fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
>>>  4.  fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
>>>
>>> I added extra test cases to file 'test/CodeGen/X86/vselect.ll' in
>>> order to verify that we correctly select movss/movsd instructions.
>>>
>>> Before this change, the backend only knew how to lower a shufflevector
>>> into a X86Movss/X86Movsd, but not how to do the same with vselect dag
>>> nodes.
>>> For that reason, all the ISel patterns introduced at r197145
>>> http://llvm.org/viewvc/llvm-project?view=revision&revision=197145
>>> were only matched if the X86Movss/X86Movsd were obtained from the
>>> custom lowering of a shufflevector.
>>>
>>> With this change, the backend is now able to combine vselect into
>>> X86Movss and therefore it can reuse the patterns from revision 197145
>>> to further simplify packed vector arithmetic operations.
>>>
>>> I added new test-cases in 'test/CodeGen/X86/sse-scalar-fp-arith-2.ll'
>>> to verify that now we correctly select SSE/AVX scalar fp instructions
>>> from a packed arithmetic instruction followed by a vselect.
>>>
>>> After this change, the following tests started failing because they
>>> always expected blendvps/blendvpd instructions in the output assembly:
>>>  test/CodeGen/X86/sse2-blend.ll
>>>  test/CodeGen/X86/avx-blend.ll
>>>  test/CodeGen/X86/blend-msb.ll
>>>  test/CodeGen/X86/sse41-blend.ll
>>>
>>> Now the backend knows how to efficiently emit movss/movsd and
>>> therefore all the failing cases are expected failures (that is because
>>> the backend knows how to select movss/movsd and not only
>>> blendvps/blendvpd).
>>>
>>> I modified those failing tests so that - when possible - the generated
>>> assembly still contains the expected blendvps/blendvpd(see for example
>>> how I changed avx-blend.ll).
>>> In all other cases I just changed the CHECK lines to verify that we
>>> produce a movss/movsd.
>>>
>>> Please let me know if ok to submit.
>>>
>>> Thanks,
>>> Andrea Di Biagio
>>> SN Systems - Sony Computer Entertainment Group.
>>> <patch-vselect.diff>
>>
-------------- next part --------------
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================

--- lib/Target/X86/X86ISelLowering.cpp	(revision 199676)
+++ lib/Target/X86/X86ISelLowering.cpp	(working copy)
@@ -17155,6 +17155,41 @@
     }
   }
 
+  // Try to fold this VSELECT into a MOVSS/MOVSD
+  if (N->getOpcode() == ISD::VSELECT &&
+      Cond.getOpcode() == ISD::BUILD_VECTOR && !DCI.isBeforeLegalize()) {
+    if (VT == MVT::v4i32 || VT == MVT::v4f32 ||
+        (Subtarget->hasSSE2() && (VT == MVT::v2i64 || VT == MVT::v2f64))) {
+      bool CanFold = false;
+      unsigned NumElems = Cond.getNumOperands();
+      SDValue A = LHS;
+      SDValue B = RHS;
+      
+      if (isZero(Cond.getOperand(0))) {
+        CanFold = true;
+
+        // fold (vselect <0,-1,-1,-1>, A, B) -> (movss A, B)
+        // fold (vselect <0,-1> -> (movsd A, B)
+        for (unsigned i = 1, e = NumElems; i != e && CanFold; ++i)
+          CanFold = isAllOnes(Cond.getOperand(i));
+      } else if (isAllOnes(Cond.getOperand(0))) {
+        CanFold = true;
+        std::swap(A, B);
+
+        // fold (vselect <-1,0,0,0>, A, B) -> (movss B, A)
+        // fold (vselect <-1,0> -> (movsd B, A)
+        for (unsigned i = 1, e = NumElems; i != e && CanFold; ++i)
+          CanFold = isZero(Cond.getOperand(i));
+      }
+
+      if (CanFold) {
+        if (VT == MVT::v4i32 || VT == MVT::v4f32)
+          return getTargetShuffleNode(X86ISD::MOVSS, DL, VT, A, B, DAG);
+        return getTargetShuffleNode(X86ISD::MOVSD, DL, VT, A, B, DAG);
+      }
+    }
+  }
+
   // If we know that this node is legal then we know that it is going to be
   // matched by one of the SSE/AVX BLEND instructions. These instructions only
   // depend on the highest bit in each word. Try to use SimplifyDemandedBits
Index: test/CodeGen/X86/sse41-blend.ll
===================================================================
--- test/CodeGen/X86/sse41-blend.ll	(revision 199676)
+++ test/CodeGen/X86/sse41-blend.ll	(working copy)
@@ -4,7 +4,7 @@
 ;CHECK: blendvps
 ;CHECK: ret
 define <4 x float> @vsel_float(<4 x float> %v1, <4 x float> %v2) {
-  %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %v1, <4 x float> %v2
+  %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 true>, <4 x float> %v1, <4 x float> %v2
   ret <4 x float> %vsel
 }
 
@@ -13,7 +13,7 @@
 ;CHECK: blendvps
 ;CHECK: ret
 define <4 x i8> @vsel_4xi8(<4 x i8> %v1, <4 x i8> %v2) {
-  %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i8> %v1, <4 x i8> %v2
+  %vsel = select <4 x i1> <i1 true, i1 true, i1 false, i1 false>, <4 x i8> %v1, <4 x i8> %v2
   ret <4 x i8> %vsel
 }
 
@@ -21,7 +21,7 @@
 ;CHECK: blendvps
 ;CHECK: ret
 define <4 x i16> @vsel_4xi16(<4 x i16> %v1, <4 x i16> %v2) {
-  %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i16> %v1, <4 x i16> %v2
+  %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 true>, <4 x i16> %v1, <4 x i16> %v2
   ret <4 x i16> %vsel
 }
 
@@ -30,13 +30,13 @@
 ;CHECK: blendvps
 ;CHECK: ret
 define <4 x i32> @vsel_i32(<4 x i32> %v1, <4 x i32> %v2) {
-  %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i32> %v1, <4 x i32> %v2
+  %vsel = select <4 x i1> <i1 true, i1 true, i1 false, i1 false>, <4 x i32> %v1, <4 x i32> %v2
   ret <4 x i32> %vsel
 }
 
 
 ;CHECK-LABEL: vsel_double:
-;CHECK: blendvpd
+;CHECK: movsd
 ;CHECK: ret
 define <4 x double> @vsel_double(<4 x double> %v1, <4 x double> %v2) {
   %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x double> %v1, <4 x double> %v2
@@ -45,7 +45,7 @@
 
 
 ;CHECK-LABEL: vsel_i64:
-;CHECK: blendvpd
+;CHECK: movsd
 ;CHECK: ret
 define <4 x i64> @vsel_i64(<4 x i64> %v1, <4 x i64> %v2) {
   %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i64> %v1, <4 x i64> %v2
Index: test/CodeGen/X86/vselect.ll
===================================================================
--- test/CodeGen/X86/vselect.ll	(revision 199676)
+++ test/CodeGen/X86/vselect.ll	(working copy)
@@ -174,3 +174,91 @@
 ; CHECK-NOT: xorps
 ; CHECK: ret
 
+define <4 x float> @test18(<4 x float> %a, <4 x float> %b) {
+  %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %b
+  ret <4 x float> %1
+}
+; CHECK-LABEL: test18
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <4 x i32> @test19(<4 x i32> %a, <4 x i32> %b) {
+  %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x i32> %a, <4 x i32> %b
+  ret <4 x i32> %1
+}
+; CHECK-LABEL: test19
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <2 x double> @test20(<2 x double> %a, <2 x double> %b) {
+  %1 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %b
+  ret <2 x double> %1
+}
+; CHECK-LABEL: test20
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
+define <2 x i64> @test21(<2 x i64> %a, <2 x i64> %b) {
+  %1 = select <2 x i1> <i1 false, i1 true>, <2 x i64> %a, <2 x i64> %b
+  ret <2 x i64> %1
+}
+; CHECK-LABEL: test21
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
+define <4 x float> @test22(<4 x float> %a, <4 x float> %b) {
+  %1 = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %a, <4 x float> %b
+  ret <4 x float> %1
+}
+; CHECK-LABEL: test22
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <4 x i32> @test23(<4 x i32> %a, <4 x i32> %b) {
+  %1 = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i32> %a, <4 x i32> %b
+  ret <4 x i32> %1
+}
+; CHECK-LABEL: test23
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movss
+; CHECK: ret
+
+define <2 x double> @test24(<2 x double> %a, <2 x double> %b) {
+  %1 = select <2 x i1> <i1 true, i1 false>, <2 x double> %a, <2 x double> %b
+  ret <2 x double> %1
+}
+; CHECK-LABEL: test24
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
+define <2 x i64> @test25(<2 x i64> %a, <2 x i64> %b) {
+  %1 = select <2 x i1> <i1 true, i1 false>, <2 x i64> %a, <2 x i64> %b
+  ret <2 x i64> %1
+}
+; CHECK-LABEL: test25
+; CHECK-NOT: psllw
+; CHECK-NOT: psraw
+; CHECK-NOT: xorps
+; CHECK: movsd
+; CHECK: ret
+
Index: test/CodeGen/X86/sse-scalar-fp-arith-2.ll
===================================================================
--- test/CodeGen/X86/sse-scalar-fp-arith-2.ll	(revision 199676)
+++ test/CodeGen/X86/sse-scalar-fp-arith-2.ll	(working copy)
@@ -213,3 +213,211 @@
 ; CHECK-NOT: movsd
 ; CHECK: ret
 
+
+define <4 x float> @test3_add_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fadd <4 x float> %a, %b
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_add_ss
+; SSE2: addss   %xmm1, %xmm0
+; AVX: vaddss   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test3_sub_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fsub <4 x float> %a, %b
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_sub_ss
+; SSE2: subss   %xmm1, %xmm0
+; AVX: vsubss   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test3_mul_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fmul <4 x float> %a, %b
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_mul_ss
+; SSE2: mulss   %xmm1, %xmm0
+; AVX: vmulss   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test3_div_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fdiv <4 x float> %a, %b
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %a, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test3_div_ss
+; SSE2: divss   %xmm1, %xmm0
+; AVX: vdivss   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <2 x double> @test3_add_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fadd <2 x double> %a, %b
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_add_sd
+; SSE2: addsd   %xmm1, %xmm0
+; AVX: vaddsd   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test3_sub_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fsub <2 x double> %a, %b
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_sub_sd
+; SSE2: subsd   %xmm1, %xmm0
+; AVX: vsubsd   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test3_mul_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fmul <2 x double> %a, %b
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_mul_sd
+; SSE2: mulsd   %xmm1, %xmm0
+; AVX: vmulsd   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test3_div_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fdiv <2 x double> %a, %b
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %a, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test3_div_sd
+; SSE2: divsd   %xmm1, %xmm0
+; AVX: vdivsd   %xmm1, %xmm0, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <4 x float> @test4_add_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fadd <4 x float> %b, %a
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_add_ss
+; SSE2: addss   %xmm0, %xmm1
+; AVX: vaddss   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test4_sub_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fsub <4 x float> %b, %a
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_sub_ss
+; SSE2: subss   %xmm0, %xmm1
+; AVX: vsubss   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test4_mul_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fmul <4 x float> %b, %a
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_mul_ss
+; SSE2: mulss   %xmm0, %xmm1
+; AVX: vmulss   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <4 x float> @test4_div_ss(<4 x float> %a, <4 x float> %b) {
+  %1 = fdiv <4 x float> %b, %a
+  %2 = select <4 x i1> <i1 false, i1 true, i1 true, i1 true>, <4 x float> %b, <4 x float> %1
+  ret <4 x float> %2
+}
+
+; CHECK-LABEL: test4_div_ss
+; SSE2: divss   %xmm0, %xmm1
+; AVX: vdivss   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movss
+; CHECK: ret
+
+
+define <2 x double> @test4_add_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fadd <2 x double> %b, %a
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_add_sd
+; SSE2: addsd   %xmm0, %xmm1
+; AVX: vaddsd   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test4_sub_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fsub <2 x double> %b, %a
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_sub_sd
+; SSE2: subsd   %xmm0, %xmm1
+; AVX: vsubsd   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test4_mul_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fmul <2 x double> %b, %a
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_mul_sd
+; SSE2: mulsd   %xmm0, %xmm1
+; AVX: vmulsd   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
+
+define <2 x double> @test4_div_sd(<2 x double> %a, <2 x double> %b) {
+  %1 = fdiv <2 x double> %b, %a
+  %2 = select <2 x i1> <i1 false, i1 true>, <2 x double> %b, <2 x double> %1
+  ret <2 x double> %2
+}
+
+; CHECK-LABEL: test4_div_sd
+; SSE2: divsd   %xmm0, %xmm1
+; AVX: vdivsd   %xmm0, %xmm1, %xmm0
+; CHECK-NOT: movsd
+; CHECK: ret
+
Index: test/CodeGen/X86/avx-blend.ll
===================================================================
--- test/CodeGen/X86/avx-blend.ll	(revision 199676)
+++ test/CodeGen/X86/avx-blend.ll	(working copy)
@@ -6,7 +6,7 @@
 ;CHECK: vblendvps
 ;CHECK: ret
 define <4 x float> @vsel_float(<4 x float> %v1, <4 x float> %v2) {
-  %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %v1, <4 x float> %v2
+  %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 false>, <4 x float> %v1, <4 x float> %v2
   ret <4 x float> %vsel
 }
 
@@ -15,13 +15,13 @@
 ;CHECK: vblendvps
 ;CHECK: ret
 define <4 x i32> @vsel_i32(<4 x i32> %v1, <4 x i32> %v2) {
-  %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i32> %v1, <4 x i32> %v2
+  %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 false>, <4 x i32> %v1, <4 x i32> %v2
   ret <4 x i32> %vsel
 }
 
 
 ;CHECK-LABEL: vsel_double:
-;CHECK: vblendvpd
+;CHECK: vmovsd
 ;CHECK: ret
 define <2 x double> @vsel_double(<2 x double> %v1, <2 x double> %v2) {
   %vsel = select <2 x i1> <i1 true, i1 false>, <2 x double> %v1, <2 x double> %v2
@@ -30,7 +30,7 @@
 
 
 ;CHECK-LABEL: vsel_i64:
-;CHECK: vblendvpd
+;CHECK: vmovsd
 ;CHECK: ret
 define <2 x i64> @vsel_i64(<2 x i64> %v1, <2 x i64> %v2) {
   %vsel = select <2 x i1> <i1 true, i1 false>, <2 x i64> %v1, <2 x i64> %v2
Index: test/CodeGen/X86/blend-msb.ll
===================================================================
--- test/CodeGen/X86/blend-msb.ll	(revision 199676)
+++ test/CodeGen/X86/blend-msb.ll	(working copy)
@@ -1,13 +1,11 @@
 ; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=corei7 -mattr=+sse4.1 | FileCheck %s
 
 
-; In this test we check that sign-extend of the mask bit is performed by
-; shifting the needed bit to the MSB, and not using shl+sra.
+; Verify that we produce movss instead of blendvps when possible.
 
 ;CHECK-LABEL: vsel_float:
-;CHECK: movl $-1
-;CHECK-NEXT: movd
-;CHECK-NEXT: blendvps
+;CHECK-NOT: blendvps
+;CHECK: movss
 ;CHECK: ret
 define <4 x float> @vsel_float(<4 x float> %v1, <4 x float> %v2) {
   %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x float> %v1, <4 x float> %v2
@@ -15,9 +13,8 @@
 }
 
 ;CHECK-LABEL: vsel_4xi8:
-;CHECK: movl $-1
-;CHECK-NEXT: movd
-;CHECK-NEXT: blendvps
+;CHECK-NOT: blendvps
+;CHECK: movss
 ;CHECK: ret
 define <4 x i8> @vsel_4xi8(<4 x i8> %v1, <4 x i8> %v2) {
   %vsel = select <4 x i1> <i1 true, i1 false, i1 false, i1 false>, <4 x i8> %v1, <4 x i8> %v2
Index: test/CodeGen/X86/sse2-blend.ll
===================================================================
--- test/CodeGen/X86/sse2-blend.ll	(revision 199676)
+++ test/CodeGen/X86/sse2-blend.ll	(working copy)
@@ -1,9 +1,9 @@
 ; RUN: llc < %s -march=x86 -mcpu=yonah -mattr=+sse2,-sse4.1 | FileCheck %s
 
-; CHECK: vsel_float
-; CHECK: xorps
+; CHECK-LABEL: vsel_float
+; CHECK-NOT: xorps
 ; CHECK: movss
-; CHECK: orps
+; CHECK-NOT: orps
 ; CHECK: ret
 define void at vsel_float(<4 x float>* %v1, <4 x float>* %v2) {
   %A = load <4 x float>* %v1
@@ -13,10 +13,17 @@
   ret void
 }
 
-; CHECK: vsel_i32
-; CHECK: xorps
+define <4 x i32> @foo(<4 x i32> %v1, <4 x i32> %v2) {
+  %and1 = and <4 x i32> %v1, <i32 -1, i32 0, i32 0, i32 0>
+  %and2 = and <4 x i32> %v2, <i32 0, i32 -1, i32 -1, i32 -1>
+  %result = or <4 x i32> %and1, %and2
+  ret <4 x i32> %result
+}
+
+; CHECK-LABEL: vsel_i32
+; CHECK-NOT: xorps
 ; CHECK: movss
-; CHECK: orps
+; CHECK-NOT: orps
 ; CHECK: ret
 define void at vsel_i32(<4 x i32>* %v1, <4 x i32>* %v2) {
   %A = load <4 x i32>* %v1
@@ -27,7 +34,7 @@
 }
 
 ; Without forcing instructions, fall back to the preferred PS domain.
-; CHECK: vsel_i64
+; CHECK-LABEL: vsel_i64
 ; CHECK: andnps
 ; CHECK: orps
 ; CHECK: ret
@@ -41,7 +48,7 @@
 }
 
 ; Without forcing instructions, fall back to the preferred PS domain.
-; CHECK: vsel_double
+; CHECK-LABEL: vsel_double
 ; CHECK: andnps
 ; CHECK: orps
 ; CHECK: ret