[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Fri Sep 19 13:22:46 PDT 2014

Hi Chandler,

I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
AVX but not AVX2).

On this particular target, there is a delay when output data from an
execution unit is used as input to another execution unit of a
different cluster. For example, There are 6 executions units which are
divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
(MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
an addition 1 cycle latency penalty.
Your new shuffle lowering algorithm is very good at keeping the
computation inside clusters. This is an improvement with respect to
the "old" shuffle lowering algorithm.

I haven't observed any significant regression in our internal codebase.
In one particular case I observed a slowdown (around 1%); here is what
I found when investigating on this slowdown.

1.  With the new shuffle lowering, there is one case where we end up
producing the following sequence:
   vmovss .LCPxx(%rip), %xmm1
   vxorps %xmm0, %xmm0, %xmm0
   vblendps $1, %xmm1, %xmm0, %xmm0

Before, we used to generate a simpler:
   vmovss .LCPxx(%rip), %xmm1

In this particular case, the 'vblendps' is redundant since the vmovss
would zero the upper bits in %xmm1. I am not sure why we get this
poor-codegen with your new shuffle lowering. I will investigate more
on this bug (maybe we no longer trigger some ISel patterns?) and I
will try to give you a small reproducible for this paticular case.

2.  There are cases where we no longer fold a vector load in one of
the operands of a shuffle.
This is an example:

     vmovaps  320(%rsp), %xmm0
     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]

Before, we used to emit the following sequence:
     # 16-byte Folded reload.
     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]

Note: the reason why the shuffle masks are different but still valid
is because the upper bits in %xmm0 are unused. Later on, the code uses
register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
%xmm0 have a meaning in this context).
As for 1. I'll try to create a small reproducible.

3.  When zero extending 2 packed 32-bit integers, we should try to
emit a vpmovzxdq
Example:
  vmovq  20(%rbx), %xmm0
  vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]

Before:
   vpmovzxdq  20(%rbx), %xmm0

4.  We no longer emit a simpler 'vmovq' in the following case:
   vxorpd %xmm4, %xmm4, %xmm4
   vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]

Before, we used to generate:
   vmovq %xmm2, %xmm4

Before, the vmovq implicitly zero-extended to 128 bits the quadword in
%xmm2. Now we always do this with a vxorpd+vblendps.

As I said, I will try to create smaller reproducible for each of the
problems I found.
I hope this helps. I will keep testing.

Thanks,
Andrea