[PATCH] [X86] Skip concat_vectors when lowering vector broadcast

Wed Dec 11 11:42:32 PST 2013

Hey Rob,

On Wed, Dec 11, 2013 at 12:54 PM, Robert Lougher <rob.lougher at gmail.com> wrote:
> Hi all,
>
> With AVX, the following two functions can be optimized using a
> vbroadcast instruction loading from memory (either a single or
> double):
>
> _m256 loadSplat4x( const float *p ) {
>   __m128 r = _mm_load1_ps( p );
>   return (m256) __builtin_shufflevector( r, r, 0, 0, 0, 0, 0, 0, 0, 0 );
> }
>
> m256d loadSplat8x( const double *p ) {
>   __m128d r = _mm_load1_pd( p );
>   return (_m256d) __builtin_shufflevector( r, r, 0, 0, 0, 0 );
> }
>
> Current output (without AVX2):
>
> loadSplat4x:
>   vmovss (%rdi), %xmm0
>   vpshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
>   vinsertf128 $1, %xmm0, %ymm0, %ymm0
>   ret
>
> loadSplat8x:
>   vmovsd (%rdi), %xmm0
>   vpermilpd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0]
>   vinsertf128 $1, %xmm0, %ymm0, %ymm0
>   ret
>
> Optimized output:
>
> loadSplat4x:
>   vbroadcastss (%rdi), %ymm0
>   ret
>
> loadSplat8x:
>   vbroadcastsd (%rdi), %ymm0
>   ret
>
> In investigating this, I discovered that the x86 backend already tries
> to use a vbroadcast instruction for a splat (LowerVectorBroadcast in
> X86ISelLowering.cpp).  However, it fails to optimize the above cases
> because we end up with a concat_vectors in the selection DAG.
>
> loadSplat4x generates the following IR (simplified):
>
> define <8 x float> @loadSplat4x(float* %p) {
>   %1 = load float* %p
>   %2 = insertelement <4 x float> undef, float %1, i32 0
>   %3 = shufflevector <4 x float> %2, <4 x float> undef, <8 x i32>
> zeroinitializer
>   ret <8 x float> %3
> }
>
> This generates the following selection DAG:
>
> %1 = f32 load %p
> %2 = v4f32 BUILD_VECTOR %1, %1, %1, %1
> %3 = v8f32 concat_vectors %2, undef
> %4 = v8f32 vector_shuffle %3, undef, <0,0,0,0,0,0,0,0>
> ...
>
> LowerVectorBroadcast() is called on the vector_shuffle. For the
> vbroadcast from memory pattern it expects to find a shuffle of either
> a scalar_to_vector or a BUILD_VECTOR but it finds a concat_vectors
> instead.  However, as we're splatting, both the concat_vectors and the
> BUILD_VECTOR can be replaced by a splat of the single loaded f32
> value.
>
> The attached patch fixes this by skipping the concat_vectors during
> pattern recognition.  In this case, once the concat_vectors is
> skipped, we get a BUILD_VECTOR, and the pattern matches.

Yes, this makes sense to me. I've been carrying this change locally
since late 2012 and have not had problems IIRC. Also IIRC, the
broadcast showed a slight performance gain over the shuffle and insert
sequence. Our compiler is currently on LLVM 3.3 though.

> The alternative is to try and combine the BUILD_VECTOR/concat_vectors
> into a larger BUILD_VECTOR at an earlier stage (e.g.
> DAGCombiner::visitCONCAT_VECTORS).  However, unless we only handle the
> specific example above, the general case will be much more complex.
> The advantage of simply skipping the concat_vectors in the broadcast
> is that we don't need to examine the operands at all (it's just 2
> lines of code).  We will also handle scalar_to_vector/concat_vectors
> (if that can ever happen), plus the AVX2 fallback to a register to
> register vbroadcast will be simplified as we do not need to do an
> Extract128BitVector.
>
> Having said that, I'm very new to LLVM development.  At the moment,
> simple fixes are preferred to big risky changes but I understand that
> generic solutions are more useful in the long run than small specific
> fixes.  Opinions and advice welcome!
>
> I do not have commit access so if you think the patch is acceptable
> please commit it for me.

There are many ways to skin a cat. ;)

If there are no other comments on a better solution, I will commit
this patch for you.

-Cameron