[PATCH][x86] If possible, always prefer to lower a VECTOR_SHUFFLE into a BLENDI instead of SHUFP (or VPERM2F128).

Wed Jun 25 10:12:25 PDT 2014

Hi all,

I noticed that method 'LowerVECTOR_SHUFFLE' always prefers to lower a
shuffle into a target specific SHUFP dag node instead of a BLENDI.
Also, (on AVX) we often prefer the slower VPERM2F128 over BLENDI.

The only reason why this happens is because we check for 'isBlendMask'
only after we already called functions 'isSHUFPMask' and
'isVPERM2X128Mask'.

My opinion is that we should probably give higher precedence to the
check for 'isBlendMask' ; the idea is that, when possible, we should
firstly check if the shuffle performs a blend, and in case, try to
lower it into a BLENDI instead of selecting a SHUFP or (worse) a
VPERM2X128.

To validate this idea, I have run some microbenchmarks (on Haswell and
AMD Jaguar).
I measured the throughput of a long sequence of instructions (~80) of
the same kind in a loop of several iterations (tested SHUFPS/D ;
BLENDPS/D and VPERM2F128).
The time unit for all measurements was the CPU clock cycles (I used
__rdtscp to read the clock cycle counters).

The results for both the SSE and AVX variants of BLENDI and SHUFP seem
to match what is reported by Agner Fog's in his "Instruction Tables".
In general:
 - AVX VBLENDPS/D always have better latency and throughput than VPERM2F128;
 - BLENDPS/D instructions tends to always have better "reciprocal
throughput" than the equivalent SHUFPS/D;
 - Both BLENDPS/D and SHUFPS/D are often decoded into the same number
of m-ops; however, a m-op obtained from a BLENDPS/D instruction can be
scheduled to more than one execution port.

Example:
>From my experiments, BLENDPS/D seems to always have a better
throughput than VSHUFPS/D (especially on Haswell).
On Haswell, SHUFPS and BLENDPS have the same latency and are decoded
to the same number of m-ops (i.e. 1). However, a m-op obtained from
decoding a
SHUFPS can only go to port 5, while the equivalent m-op from a BLENDPS
can be scheduled to either port0 or port1 or port5.

Back to the patch..
This patch:
 - Moves the check for 'isBlendMask' immediately before the check for
'isSHUFPMask' within method 'LowerVECTOR_SHUFFLE';
 - Updates existing tests for sse/avx shuffle/blend instructions to
verify that we select (v)blendps/d when possible (instead of shufps/d
or vperm2f128).

Please let me know what you think.

Thanks,
Andrea Di Biagio
-------------- next part --------------
Index: test/CodeGen/X86/avx-blend.ll
===================================================================

--- test/CodeGen/X86/avx-blend.ll	(revision 211717)
+++ test/CodeGen/X86/avx-blend.ll	(working copy)
@@ -110,7 +110,7 @@
 
 ;CHECK-LABEL: vsel_double4:
 ;CHECK-NOT: vinsertf128
-;CHECK: vshufpd $10
+;CHECK: vblendpd $10
 ;CHECK-NEXT: ret
 define <4 x double> @vsel_double4(<4 x double> %v1, <4 x double> %v2) {
   %vsel = select <4 x i1> <i1 true, i1 false, i1 true, i1 false>, <4 x double> %v1, <4 x double> %v2
Index: test/CodeGen/X86/avx-shuffle.ll
===================================================================
--- test/CodeGen/X86/avx-shuffle.ll	(revision 211717)
+++ test/CodeGen/X86/avx-shuffle.ll	(working copy)
@@ -25,7 +25,7 @@
   %c = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 4, i32 5, i32 2, i32 undef>
   ret <4 x i64> %c
 ; CHECK-LABEL: test3:
-; CHECK: vperm2f128
+; CHECK: vblendpd
 ; CHECK: ret
 }
 
Index: test/CodeGen/X86/avx-vshufp.ll
===================================================================
--- test/CodeGen/X86/avx-vshufp.ll	(revision 211717)
+++ test/CodeGen/X86/avx-vshufp.ll	(working copy)
@@ -32,14 +32,14 @@
   ret <8 x i32> %shuffle
 }
 
-; CHECK: vshufpd  $10, %ymm
+; CHECK: vblendpd  $10, %ymm
 define <4 x double> @B(<4 x double> %a, <4 x double> %b) nounwind uwtable readnone ssp {
 entry:
   %shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
   ret <4 x double> %shuffle
 }
 
-; CHECK: vshufpd  $10, (%{{.*}}), %ymm
+; CHECK: vblendpd  $10, (%{{.*}}), %ymm
 define <4 x double> @B2(<4 x double>* %a, <4 x double>* %b) nounwind uwtable readnone ssp {
 entry:
   %a2 = load <4 x double>* %a
@@ -48,14 +48,14 @@
   ret <4 x double> %shuffle
 }
 
-; CHECK: vshufpd  $10, %ymm
+; CHECK: vblendpd  $10, %ymm
 define <4 x i64> @B3(<4 x i64> %a, <4 x i64> %b) nounwind uwtable readnone ssp {
 entry:
   %shuffle = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
   ret <4 x i64> %shuffle
 }
 
-; CHECK: vshufpd  $10, (%{{.*}}), %ymm
+; CHECK: vblendpd  $10, (%{{.*}}), %ymm
 define <4 x i64> @B4(<4 x i64>* %a, <4 x i64>* %b) nounwind uwtable readnone ssp {
 entry:
   %a2 = load <4 x i64>* %a
@@ -71,7 +71,7 @@
   ret <8 x float> %shuffle
 }
 
-; CHECK: vshufpd  $2, %ymm
+; CHECK: vblendpd  $2, %ymm
 define <4 x double> @D(<4 x double> %a, <4 x double> %b) nounwind uwtable readnone ssp {
 entry:
   %shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 5, i32 2, i32 undef>
Index: test/CodeGen/X86/avx-vperm2f128.ll
===================================================================
--- test/CodeGen/X86/avx-vperm2f128.ll	(revision 211717)
+++ test/CodeGen/X86/avx-vperm2f128.ll	(working copy)
@@ -9,7 +9,7 @@
 }
 
 ; CHECK: _B
-; CHECK: vperm2f128 $48
+; CHECK: vblendps $240
 define <8 x float> @B(<8 x float> %a, <8 x float> %b) nounwind uwtable readnone ssp {
 entry:
   %shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 12, i32 13, i32 14, i32 15>
Index: test/CodeGen/X86/combine-or.ll
===================================================================
--- test/CodeGen/X86/combine-or.ll	(revision 211717)
+++ test/CodeGen/X86/combine-or.ll	(working copy)
@@ -74,7 +74,7 @@
 }
 ; CHECK-LABEL: test6
 ; CHECK-NOT: xorps
-; CHECK: shufps
+; CHECK: blendps $12
 ; CHECK-NEXT: ret
 
 
@@ -86,7 +86,7 @@
 }
 ; CHECK-LABEL: test7
 ; CHECK-NOT: xorps
-; CHECK: shufps
+; CHECK: blendps $12
 ; CHECK-NEXT: ret
 
 
Index: lib/Target/X86/X86InstrSSE.td
===================================================================
--- lib/Target/X86/X86InstrSSE.td	(revision 211717)
+++ lib/Target/X86/X86InstrSSE.td	(working copy)
@@ -5374,8 +5374,8 @@
   // - the 1st and 3rd element from the first input vector (the 'fsub' node);
   // - the 2nd and 4th element from the second input vector (the 'fadd' node).
 
-  def : Pat<(v4f64 (X86Shufp (v4f64 (fsub VR256:$lhs, VR256:$rhs)),
-                             (v4f64 (fadd VR256:$lhs, VR256:$rhs)), (i8 10))),
+  def : Pat<(v4f64 (X86Blendi (v4f64 (fsub VR256:$lhs, VR256:$rhs)),
+                             (v4f64 (fadd VR256:$lhs, VR256:$rhs)), (i32 10))),
             (VADDSUBPDYrr VR256:$lhs, VR256:$rhs)>;
   def : Pat<(v4f64 (X86Blendi (v4f64 (fsub VR256:$lhs, VR256:$rhs)),
                               (v4f64 (fadd VR256:$lhs, VR256:$rhs)), (i32 10))),
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================
--- lib/Target/X86/X86ISelLowering.cpp	(revision 211717)
+++ lib/Target/X86/X86ISelLowering.cpp	(working copy)
@@ -8337,6 +8337,11 @@
                                 getShufflePSHUFLWImmediate(SVOp),
                                 DAG);
 
+  unsigned MaskValue;
+  if (isBlendMask(M, VT, Subtarget->hasSSE41(), Subtarget->hasInt256(),
+                  &MaskValue))
+    return LowerVECTOR_SHUFFLEtoBlend(SVOp, MaskValue, Subtarget, DAG);
+
   if (isSHUFPMask(M, VT))
     return getTargetShuffleNode(X86ISD::SHUFP, dl, VT, V1, V2,
                                 getShuffleSHUFImmediate(SVOp), DAG);
@@ -8374,11 +8379,6 @@
     return getTargetShuffleNode(X86ISD::VPERM2X128, dl, VT, V1,
                                 V2, getShuffleVPERM2X128Immediate(SVOp), DAG);
 
-  unsigned MaskValue;
-  if (isBlendMask(M, VT, Subtarget->hasSSE41(), Subtarget->hasInt256(),
-                  &MaskValue))
-    return LowerVECTOR_SHUFFLEtoBlend(SVOp, MaskValue, Subtarget, DAG);
-
   if (Subtarget->hasSSE41() && isINSERTPSMask(M, VT))
     return getINSERTPS(SVOp, dl, DAG);