[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Sat Sep 20 07:12:39 PDT 2014

On 19 Sep 2014, at 21:22, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> 2.  There are cases where we no longer fold a vector load in one of
> the operands of a shuffle.
> This is an example:
> 
>     vmovaps  320(%rsp), %xmm0
>     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]
> 
> Before, we used to emit the following sequence:
>     # 16-byte Folded reload.
>     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]
> 
> Note: the reason why the shuffle masks are different but still valid
> is because the upper bits in %xmm0 are unused. Later on, the code uses
> register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
> %xmm0 have a meaning in this context).
> As for 1. I'll try to create a small reproducible.

Hi Andrea /  Chandler / Quentin,

If AVX is available I would expect the vpermilps/vpermilpd instruction to be used for all float/double single vector shuffles, especially as it can deal with the folded load case as well - this would avoid the integer/float execution domain transfer issue with using vpshufd.

Thanks, Simon.