[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Andrea Di Biagio
andrea.dibiagio at gmail.com
Fri Sep 19 13:22:46 PDT 2014
Hi Chandler,
I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
AVX but not AVX2).
On this particular target, there is a delay when output data from an
execution unit is used as input to another execution unit of a
different cluster. For example, There are 6 executions units which are
divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
(MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
an addition 1 cycle latency penalty.
Your new shuffle lowering algorithm is very good at keeping the
computation inside clusters. This is an improvement with respect to
the "old" shuffle lowering algorithm.
I haven't observed any significant regression in our internal codebase.
In one particular case I observed a slowdown (around 1%); here is what
I found when investigating on this slowdown.
1. With the new shuffle lowering, there is one case where we end up
producing the following sequence:
vmovss .LCPxx(%rip), %xmm1
vxorps %xmm0, %xmm0, %xmm0
vblendps $1, %xmm1, %xmm0, %xmm0
Before, we used to generate a simpler:
vmovss .LCPxx(%rip), %xmm1
In this particular case, the 'vblendps' is redundant since the vmovss
would zero the upper bits in %xmm1. I am not sure why we get this
poor-codegen with your new shuffle lowering. I will investigate more
on this bug (maybe we no longer trigger some ISel patterns?) and I
will try to give you a small reproducible for this paticular case.
2. There are cases where we no longer fold a vector load in one of
the operands of a shuffle.
This is an example:
vmovaps 320(%rsp), %xmm0
vshufps $-27, %xmm0, %xmm0, %xmm0 # %xmm0 = %xmm0[1,1,2,3]
Before, we used to emit the following sequence:
# 16-byte Folded reload.
vpshufd $1, 320(%rsp), %xmm0 # %xmm0 = mem[1,0,0,0]
Note: the reason why the shuffle masks are different but still valid
is because the upper bits in %xmm0 are unused. Later on, the code uses
register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
%xmm0 have a meaning in this context).
As for 1. I'll try to create a small reproducible.
3. When zero extending 2 packed 32-bit integers, we should try to
emit a vpmovzxdq
Example:
vmovq 20(%rbx), %xmm0
vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
Before:
vpmovzxdq 20(%rbx), %xmm0
4. We no longer emit a simpler 'vmovq' in the following case:
vxorpd %xmm4, %xmm4, %xmm4
vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]
Before, we used to generate:
vmovq %xmm2, %xmm4
Before, the vmovq implicitly zero-extended to 128 bits the quadword in
%xmm2. Now we always do this with a vxorpd+vblendps.
As I said, I will try to create smaller reproducible for each of the
problems I found.
I hope this helps. I will keep testing.
Thanks,
Andrea
More information about the llvm-dev
mailing list