[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Quentin Colombet
qcolombet at apple.com
Fri Sep 19 13:36:01 PDT 2014
Hi Andrea,
I think most if not all the regressions are covered by the previous test cases I’ve provided.
Please double check if you want to avoid reducing them :).
On Sep 19, 2014, at 1:22 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
> Hi Chandler,
>
> I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
> AVX but not AVX2).
>
> On this particular target, there is a delay when output data from an
> execution unit is used as input to another execution unit of a
> different cluster. For example, There are 6 executions units which are
> divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
> (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
> an addition 1 cycle latency penalty.
> Your new shuffle lowering algorithm is very good at keeping the
> computation inside clusters. This is an improvement with respect to
> the "old" shuffle lowering algorithm.
>
> I haven't observed any significant regression in our internal codebase.
> In one particular case I observed a slowdown (around 1%); here is what
> I found when investigating on this slowdown.
>
> 1. With the new shuffle lowering, there is one case where we end up
> producing the following sequence:
> vmovss .LCPxx(%rip), %xmm1
> vxorps %xmm0, %xmm0, %xmm0
> vblendps $1, %xmm1, %xmm0, %xmm0
>
> Before, we used to generate a simpler:
> vmovss .LCPxx(%rip), %xmm1
>
> In this particular case, the 'vblendps' is redundant since the vmovss
> would zero the upper bits in %xmm1. I am not sure why we get this
> poor-codegen with your new shuffle lowering. I will investigate more
> on this bug (maybe we no longer trigger some ISel patterns?) and I
> will try to give you a small reproducible for this paticular case.
I think it should already be covered by one of the test case I provided: none_useless_shuflle.ll
>
> 2. There are cases where we no longer fold a vector load in one of
> the operands of a shuffle.
> This is an example:
>
> vmovaps 320(%rsp), %xmm0
> vshufps $-27, %xmm0, %xmm0, %xmm0 # %xmm0 = %xmm0[1,1,2,3]
>
> Before, we used to emit the following sequence:
> # 16-byte Folded reload.
> vpshufd $1, 320(%rsp), %xmm0 # %xmm0 = mem[1,0,0,0]
>
> Note: the reason why the shuffle masks are different but still valid
> is because the upper bits in %xmm0 are unused. Later on, the code uses
> register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
> %xmm0 have a meaning in this context).
> As for 1. I'll try to create a small reproducible.
Same here, I think this is already covered by: missing_folding.ll
>
> 3. When zero extending 2 packed 32-bit integers, we should try to
> emit a vpmovzxdq
> Example:
> vmovq 20(%rbx), %xmm0
> vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
>
> Before:
> vpmovzxdq 20(%rbx), %xmm0
Probably same logic as: sse4.1_pmovzxwd.ll
But you can double check it.
>
> 4. We no longer emit a simpler 'vmovq' in the following case:
> vxorpd %xmm4, %xmm4, %xmm4
> vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]
>
> Before, we used to generate:
> vmovq %xmm2, %xmm4
>
> Before, the vmovq implicitly zero-extended to 128 bits the quadword in
> %xmm2. Now we always do this with a vxorpd+vblendps.
Probably same as: none_useless_shuflle.ll
Cheers,
Q.
>
> As I said, I will try to create smaller reproducible for each of the
> problems I found.
> I hope this helps. I will keep testing.
>
> Thanks,
> Andrea
More information about the llvm-dev
mailing list