[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Thu Jan 29 11:50:24 PST 2015

On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com>
wrote:

>
> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at gmail.com>
> wrote:
>
>> Hi Chandler,
>>
>> I've been looking at the regressions Quentin mentioned, and filed a PR
>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>>
>> As for the others, I'm working on reducing them, but for now, here are
>> some raw observations, in case any of it rings a bell:
>>
>
> Very cool, and thanks for the analysis!
>
>
>>
>>
>> Another problem I'm seeing is that in some cases we can't fold memory
>> anymore:
>>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
>> becomes:
>>     vmovaps       -0xXX(%rdx), %xmm2
>>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
>>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
>> xmm3[0,2],xmm0[1,2]
>>
>>
>> Also, I see differences when some loads are shuffled, that I'm a bit
>> conflicted about:
>>     vmovaps       -0xXX(%rbp), %xmm3
>>     ...
>>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
>> becomes:
>>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>>     ...
>>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]
>>
>> Note that the second version does the shuffle in-place, in xmm2.
>>
>>
>> Some are blends (har har) of those two:
>>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
>>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
>> becomes:
>>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>> = xmm0[3,0],xmm_mem_1[0,0]
>>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>> = xmm0[0,2],xmm_mem_1[1,2]
>>
>>
>> I also see a lot of somewhat neutral (focusing on Haswell for now)
>> domain changes such as (xmm5 and 0 are initially integers, and are
>> dead after the store):
>>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>>     vmovdqu       %xmm0, 0x20(%rax)
>> turning into:
>>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
>>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 =
>> xmm0[0,2],xmm5[1,2]
>>     vmovups       %xmm0, 0x20(%rax)
>>
>
> All of these stem from what I think is the same core weakness of the
> current algorithm: we prefer the fully general shufps+shufps 4-way
> shuffle/blend far too often. Here is how I would more precisely classify
> the two things missing here:
>
> - Check if either inputs are "in place" and we can do a fast single-input
> shuffle with a fixed blend.
>

I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390

> - Check if we can form a rotation and use palignr to finish a shuffle/blend
>

.. and this would be  http://llvm.org/bugs/show_bug.cgi?id=22391

I think this about covers the Haswell regressions I'm seeing.  Now for some
pre-AVX fun!

-Ahmed

> There may be other patterns we're missing, but these two seem to jump out
> based on your analysis, and may be fairly easy to tackle.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150129/d2c379ce/attachment.html>