[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Fri Jan 30 11:23:36 PST 2015

I may get one or two in the next month, but not more than that. Focused on
the pass manager for now. If none get there first, I'll eventually circle
back though, so they won't rot forever.
On Jan 30, 2015 11:21 AM, "Ahmed Bougacha" <ahmed.bougacha at gmail.com> wrote:

> I filed a couple more, in case they're actually different issues:
> - http://llvm.org/bugs/show_bug.cgi?id=22412
> - http://llvm.org/bugs/show_bug.cgi?id=22413
>
> And that's pretty much it for internal changes.  I'm fine with flipping
> the switch; Quentin, are you?
> Also, just to have an idea, do you (or someone else!) plan to tackle these
> in the near future?
>
> -Ahmed
>
> On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha <ahmed.bougacha at gmail.com
> > wrote:
>
>>
>> On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com>
>> wrote:
>>
>>>
>>> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <
>>> ahmed.bougacha at gmail.com> wrote:
>>>
>>>> Hi Chandler,
>>>>
>>>> I've been looking at the regressions Quentin mentioned, and filed a PR
>>>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>>>>
>>>> As for the others, I'm working on reducing them, but for now, here are
>>>> some raw observations, in case any of it rings a bell:
>>>>
>>>
>>> Very cool, and thanks for the analysis!
>>>
>>>
>>>>
>>>>
>>>> Another problem I'm seeing is that in some cases we can't fold memory
>>>> anymore:
>>>>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
>>>> becomes:
>>>>     vmovaps       -0xXX(%rdx), %xmm2
>>>>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 =
>>>> xmm2[3,0],xmm0[0,0]
>>>>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
>>>> xmm3[0,2],xmm0[1,2]
>>>>
>>>>
>>>> Also, I see differences when some loads are shuffled, that I'm a bit
>>>> conflicted about:
>>>>     vmovaps       -0xXX(%rbp), %xmm3
>>>>     ...
>>>>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 =
>>>> xmm4[3],xmm3[1,2,3]
>>>> becomes:
>>>>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>>     ...
>>>>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 =
>>>> xmm4[3],xmm2[1,2,3]
>>>>
>>>> Note that the second version does the shuffle in-place, in xmm2.
>>>>
>>>>
>>>> Some are blends (har har) of those two:
>>>>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
>>>>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>>>>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 =
>>>> xmm1[0],xmm6[1,2,3]
>>>> becomes:
>>>>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>>>>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>>>>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>>>> = xmm0[3,0],xmm_mem_1[0,0]
>>>>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>>>> = xmm0[0,2],xmm_mem_1[1,2]
>>>>
>>>>
>>>> I also see a lot of somewhat neutral (focusing on Haswell for now)
>>>> domain changes such as (xmm5 and 0 are initially integers, and are
>>>> dead after the store):
>>>>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>>>>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
>>>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>>>>     vmovdqu       %xmm0, 0x20(%rax)
>>>> turning into:
>>>>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 =
>>>> xmm0[2,0],xmm5[0,0]
>>>>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 =
>>>> xmm0[0,2],xmm5[1,2]
>>>>     vmovups       %xmm0, 0x20(%rax)
>>>>
>>>
>>> All of these stem from what I think is the same core weakness of the
>>> current algorithm: we prefer the fully general shufps+shufps 4-way
>>> shuffle/blend far too often. Here is how I would more precisely classify
>>> the two things missing here:
>>>
>>> - Check if either inputs are "in place" and we can do a fast
>>> single-input shuffle with a fixed blend.
>>>
>>
>> I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390
>>
>>
>>> - Check if we can form a rotation and use palignr to finish a
>>> shuffle/blend
>>>
>>
>> .. and this would be  http://llvm.org/bugs/show_bug.cgi?id=22391
>>
>> I think this about covers the Haswell regressions I'm seeing.  Now for
>> some pre-AVX fun!
>>
>> -Ahmed
>>
>>
>>> There may be other patterns we're missing, but these two seem to jump
>>> out based on your analysis, and may be fairly easy to tackle.
>>>
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/ebb1638c/attachment.html>