[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Fri Sep 19 13:36:01 PDT 2014

Hi Andrea,

I think most if not all the regressions are covered by the previous test cases I’ve provided.
Please double check if you want to avoid reducing them :).

On Sep 19, 2014, at 1:22 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> Hi Chandler,
> 
> I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
> AVX but not AVX2).
> 
> On this particular target, there is a delay when output data from an
> execution unit is used as input to another execution unit of a
> different cluster. For example, There are 6 executions units which are
> divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
> (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
> an addition 1 cycle latency penalty.
> Your new shuffle lowering algorithm is very good at keeping the
> computation inside clusters. This is an improvement with respect to
> the "old" shuffle lowering algorithm.
> 
> I haven't observed any significant regression in our internal codebase.
> In one particular case I observed a slowdown (around 1%); here is what
> I found when investigating on this slowdown.
> 
> 1.  With the new shuffle lowering, there is one case where we end up
> producing the following sequence:
>   vmovss .LCPxx(%rip), %xmm1
>   vxorps %xmm0, %xmm0, %xmm0
>   vblendps $1, %xmm1, %xmm0, %xmm0
> 
> Before, we used to generate a simpler:
>   vmovss .LCPxx(%rip), %xmm1
> 
> In this particular case, the 'vblendps' is redundant since the vmovss
> would zero the upper bits in %xmm1. I am not sure why we get this
> poor-codegen with your new shuffle lowering. I will investigate more
> on this bug (maybe we no longer trigger some ISel patterns?) and I
> will try to give you a small reproducible for this paticular case.

I think it should already be covered by one of the test case I provided: none_useless_shuflle.ll

> 
> 2.  There are cases where we no longer fold a vector load in one of
> the operands of a shuffle.
> This is an example:
> 
>     vmovaps  320(%rsp), %xmm0
>     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]
> 
> Before, we used to emit the following sequence:
>     # 16-byte Folded reload.
>     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]
> 
> Note: the reason why the shuffle masks are different but still valid
> is because the upper bits in %xmm0 are unused. Later on, the code uses
> register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
> %xmm0 have a meaning in this context).
> As for 1. I'll try to create a small reproducible.

Same here, I think this is already covered by: missing_folding.ll

> 
> 3.  When zero extending 2 packed 32-bit integers, we should try to
> emit a vpmovzxdq
> Example:
>  vmovq  20(%rbx), %xmm0
>  vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
> 
> Before:
>   vpmovzxdq  20(%rbx), %xmm0

Probably same logic as: sse4.1_pmovzxwd.ll
But you can double check it. 

> 
> 4.  We no longer emit a simpler 'vmovq' in the following case:
>   vxorpd %xmm4, %xmm4, %xmm4
>   vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]
> 
> Before, we used to generate:
>   vmovq %xmm2, %xmm4
> 
> Before, the vmovq implicitly zero-extended to 128 bits the quadword in
> %xmm2. Now we always do this with a vxorpd+vblendps.

Probably same as: none_useless_shuflle.ll

Cheers,
Q.
> 
> As I said, I will try to create smaller reproducible for each of the
> problems I found.
> I hope this helps. I will keep testing.
> 
> Thanks,
> Andrea