[PATCH] [X86] Skip concat_vectors when lowering vector broadcast

Wed Dec 11 09:54:34 PST 2013

Hi all,

With AVX, the following two functions can be optimized using a
vbroadcast instruction loading from memory (either a single or
double):

_m256 loadSplat4x( const float *p ) {
  __m128 r = _mm_load1_ps( p );
  return (m256) __builtin_shufflevector( r, r, 0, 0, 0, 0, 0, 0, 0, 0 );
}

m256d loadSplat8x( const double *p ) {
  __m128d r = _mm_load1_pd( p );
  return (_m256d) __builtin_shufflevector( r, r, 0, 0, 0, 0 );
}

Current output (without AVX2):

loadSplat4x:
  vmovss (%rdi), %xmm0
  vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
  vinsertf128 $1, %xmm0, %ymm0, %ymm0
  ret

loadSplat8x:
  vmovsd (%rdi), %xmm0
  vpermilpd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0]
  vinsertf128 $1, %xmm0, %ymm0, %ymm0
  ret

Optimized output:

loadSplat4x:
  vbroadcastss (%rdi), %ymm0
  ret

loadSplat8x:
  vbroadcastsd (%rdi), %ymm0
  ret

In investigating this, I discovered that the x86 backend already tries
to use a vbroadcast instruction for a splat (LowerVectorBroadcast in
X86ISelLowering.cpp).  However, it fails to optimize the above cases
because we end up with a concat_vectors in the selection DAG.

loadSplat4x generates the following IR (simplified):

define <8 x float> @loadSplat4x(float* %p) {
  %1 = load float* %p
  %2 = insertelement <4 x float> undef, float %1, i32 0
  %3 = shufflevector <4 x float> %2, <4 x float> undef, <8 x i32>
zeroinitializer
  ret <8 x float> %3
}

This generates the following selection DAG:

%1 = f32 load %p
%2 = v4f32 BUILD_VECTOR %1, %1, %1, %1
%3 = v8f32 concat_vectors %2, undef
%4 = v8f32 vector_shuffle %3, undef, <0,0,0,0,0,0,0,0>
...

LowerVectorBroadcast() is called on the vector_shuffle. For the
vbroadcast from memory pattern it expects to find a shuffle of either
a scalar_to_vector or a BUILD_VECTOR but it finds a concat_vectors
instead.  However, as we're splatting, both the concat_vectors and the
BUILD_VECTOR can be replaced by a splat of the single loaded f32
value.

The attached patch fixes this by skipping the concat_vectors during
pattern recognition.  In this case, once the concat_vectors is
skipped, we get a BUILD_VECTOR, and the pattern matches.

The alternative is to try and combine the BUILD_VECTOR/concat_vectors
into a larger BUILD_VECTOR at an earlier stage (e.g.
DAGCombiner::visitCONCAT_VECTORS).  However, unless we only handle the
specific example above, the general case will be much more complex.
The advantage of simply skipping the concat_vectors in the broadcast
is that we don't need to examine the operands at all (it's just 2
lines of code).  We will also handle scalar_to_vector/concat_vectors
(if that can ever happen), plus the AVX2 fallback to a register to
register vbroadcast will be simplified as we do not need to do an
Extract128BitVector.

Having said that, I'm very new to LLVM development.  At the moment,
simple fixes are preferred to big risky changes but I understand that
generic solutions are more useful in the long run than small specific
fixes.  Opinions and advice welcome!

I do not have commit access so if you think the patch is acceptable
please commit it for me.

Thanks,
Rob.

--
Robert Lougher
SN Systems - Sony Computer Entertainment Group
-------------- next part --------------
Index: lib/Target/X86/X86ISelLowering.cpp
===================================================================

--- lib/Target/X86/X86ISelLowering.cpp	(revision 196951)
+++ lib/Target/X86/X86ISelLowering.cpp	(working copy)
@@ -5555,6 +5555,12 @@
         return SDValue();
 
       SDValue Sc = Op.getOperand(0);
+
+      // There may be a concat_vector between the shuffle and the
+      // scalar_to_vector.  As the shuffle is a splat we can safely skip it.
+      if (Sc.getOpcode() == ISD::CONCAT_VECTORS)
+        Sc = Sc.getOperand(0);
+
       if (Sc.getOpcode() != ISD::SCALAR_TO_VECTOR &&
           Sc.getOpcode() != ISD::BUILD_VECTOR) {
 
@@ -5562,7 +5568,9 @@
           return SDValue();
 
         // Use the register form of the broadcast instruction available on AVX2.
-        if (VT.getSizeInBits() >= 256)
+        // As we may have skipped a concat_vector we must check the size of 'Sc'
+        // rather than the size of the shuffle.
+        if (Sc.getSimpleValueType().getSizeInBits() >= 256)
           Sc = Extract128BitVector(Sc, 0, DAG, dl);
         return DAG.getNode(X86ISD::VBROADCAST, dl, VT, Sc);
       }
Index: test/CodeGen/X86/vec_shuf-concat.ll
===================================================================
--- test/CodeGen/X86/vec_shuf-concat.ll	(revision 0)
+++ test/CodeGen/X86/vec_shuf-concat.ll	(revision 0)
@@ -0,0 +1,56 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-linux -mcpu=corei7-avx | FileCheck %s -check-prefix=CHECK -check-prefix=AVX
+; RUN: llc < %s -mtriple=x86_64-unknown-linux -mcpu=core-avx2  | FileCheck %s -check-prefix=CHECK -check-prefix=AVX2
+
+; These tests check that a vbroadcast instruction is used for a shufflevector
+; splat.  The first two functions check that a memory to register vbroadcast
+; is used for a load/splat pair (single and double).  This form of the
+; instruction is available on both AVX and AVX2.  The register to register
+; vbroadcast, however, is not available with AVX.  The last two functions
+; test that a single splat is lowered into a vbroadcast only when AVX2 is
+; supported.
+
+define <8 x float> @loadSplat4x(float* %p) {
+  %1 = load float* %p
+  %2 = insertelement <4 x float> undef, float %1, i32 0
+  %3 = shufflevector <4 x float> %2, <4 x float> undef, <8 x i32> zeroinitializer
+  ret <8 x float> %3
+
+; CHECK-LABEL: loadSplat4x
+; CHECK: vbroadcastss	(%rdi), %ymm0
+; CHECK-NEXT: ret
+}
+
+define <4 x double> @loadSplat8x(double* %p) {
+  %1 = load double* %p
+  %2 = insertelement <2 x double> undef, double %1, i32 0
+  %3 = shufflevector <2 x double> %2, <2 x double> undef, <4 x i32> zeroinitializer
+  ret <4 x double> %3
+
+; CHECK-LABEL: loadSplat8x
+; CHECK: vbroadcastsd	(%rdi), %ymm0
+; CHECK-NEXT: ret
+}
+
+define <8 x float> @splat4x(<4 x float> %r) {
+  %1 = shufflevector <4 x float> %r, <4 x float> undef, <8 x i32> zeroinitializer
+  ret <8 x float> %1
+
+; AVX-LABEL: splat4x
+; AVX-NOT: vbroadcast
+; AVX: ret
+; AVX2-LABEL: splat4x
+; AVX2: vbroadcastss	%xmm0, %ymm0
+; AVX2-NEXT: ret
+}
+
+define <4 x double> @splat8x(<2 x double> %r) {
+  %1 = shufflevector <2 x double> %r, <2 x double> undef, <4 x i32> zeroinitializer
+  ret <4 x double> %1
+
+; AVX-LABEL: splat8x
+; AVX-NOT: vbroadcast
+; AVX: ret
+; AVX2-LABEL: splat8x
+; AVX2: vbroadcastsd	%xmm0, %ymm0
+; AVX2-NEXT: ret
+}