[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Wed Jan 28 16:47:32 PST 2015

On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at gmail.com>
wrote:

> Hi Chandler,
>
> I've been looking at the regressions Quentin mentioned, and filed a PR
> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>
> As for the others, I'm working on reducing them, but for now, here are
> some raw observations, in case any of it rings a bell:
>

Very cool, and thanks for the analysis!

>
>
> Another problem I'm seeing is that in some cases we can't fold memory
> anymore:
>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
> becomes:
>     vmovaps       -0xXX(%rdx), %xmm2
>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]
>
>
> Also, I see differences when some loads are shuffled, that I'm a bit
> conflicted about:
>     vmovaps       -0xXX(%rbp), %xmm3
>     ...
>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
> becomes:
>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>     ...
>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]
>
> Note that the second version does the shuffle in-place, in xmm2.
>
>
> Some are blends (har har) of those two:
>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
> becomes:
>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
> = xmm0[3,0],xmm_mem_1[0,0]
>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
> = xmm0[0,2],xmm_mem_1[1,2]
>
>
> I also see a lot of somewhat neutral (focusing on Haswell for now)
> domain changes such as (xmm5 and 0 are initially integers, and are
> dead after the store):
>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>     vmovdqu       %xmm0, 0x20(%rax)
> turning into:
>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
>     vmovups       %xmm0, 0x20(%rax)
>

All of these stem from what I think is the same core weakness of the
current algorithm: we prefer the fully general shufps+shufps 4-way
shuffle/blend far too often. Here is how I would more precisely classify
the two things missing here:

- Check if either inputs are "in place" and we can do a fast single-input
shuffle with a fixed blend.
- Check if we can form a rotation and use palignr to finish a shuffle/blend

There may be other patterns we're missing, but these two seem to jump out
based on your analysis, and may be fairly easy to tackle.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150128/c312f99e/attachment.html>