[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Fri Jan 30 11:25:44 PST 2015

On Fri, Jan 30, 2015 at 11:23 AM, Chandler Carruth <chandlerc at gmail.com>
wrote:

> I may get one or two in the next month, but not more than that. Focused on
> the pass manager for now. If none get there first, I'll eventually circle
> back though, so they won't rot forever.
>
Alright, I'll give it a try in the next few weeks as well.

-Ahmed

> On Jan 30, 2015 11:21 AM, "Ahmed Bougacha" <ahmed.bougacha at gmail.com>
> wrote:
>
>> I filed a couple more, in case they're actually different issues:
>> - http://llvm.org/bugs/show_bug.cgi?id=22412
>> - http://llvm.org/bugs/show_bug.cgi?id=22413
>>
>> And that's pretty much it for internal changes.  I'm fine with flipping
>> the switch; Quentin, are you?
>> Also, just to have an idea, do you (or someone else!) plan to tackle
>> these in the near future?
>>
>> -Ahmed
>>
>> On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha <
>> ahmed.bougacha at gmail.com> wrote:
>>
>>>
>>> On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com>
>>> wrote:
>>>
>>>>
>>>> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <
>>>> ahmed.bougacha at gmail.com> wrote:
>>>>
>>>>> Hi Chandler,
>>>>>
>>>>> I've been looking at the regressions Quentin mentioned, and filed a PR
>>>>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>>>>>
>>>>> As for the others, I'm working on reducing them, but for now, here are
>>>>> some raw observations, in case any of it rings a bell:
>>>>>
>>>>
>>>> Very cool, and thanks for the analysis!
>>>>
>>>>
>>>>>
>>>>>
>>>>> Another problem I'm seeing is that in some cases we can't fold memory
>>>>> anymore:
>>>>>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>>>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
>>>>> becomes:
>>>>>     vmovaps       -0xXX(%rdx), %xmm2
>>>>>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 =
>>>>> xmm2[3,0],xmm0[0,0]
>>>>>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 =
>>>>> xmm3[0,2],xmm0[1,2]
>>>>>
>>>>>
>>>>> Also, I see differences when some loads are shuffled, that I'm a bit
>>>>> conflicted about:
>>>>>     vmovaps       -0xXX(%rbp), %xmm3
>>>>>     ...
>>>>>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 =
>>>>> xmm4[3],xmm3[1,2,3]
>>>>> becomes:
>>>>>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>>>     ...
>>>>>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 =
>>>>> xmm4[3],xmm2[1,2,3]
>>>>>
>>>>> Note that the second version does the shuffle in-place, in xmm2.
>>>>>
>>>>>
>>>>> Some are blends (har har) of those two:
>>>>>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 =
>>>>> xmm_mem_1[3,0,1,2]
>>>>>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>>>>>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 =
>>>>> xmm1[0],xmm6[1,2,3]
>>>>> becomes:
>>>>>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>>>>>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>>>>>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>>>>> = xmm0[3,0],xmm_mem_1[0,0]
>>>>>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>>>>> = xmm0[0,2],xmm_mem_1[1,2]
>>>>>
>>>>>
>>>>> I also see a lot of somewhat neutral (focusing on Haswell for now)
>>>>> domain changes such as (xmm5 and 0 are initially integers, and are
>>>>> dead after the store):
>>>>>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>>>>>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
>>>>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>>>>>     vmovdqu       %xmm0, 0x20(%rax)
>>>>> turning into:
>>>>>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 =
>>>>> xmm0[2,0],xmm5[0,0]
>>>>>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 =
>>>>> xmm0[0,2],xmm5[1,2]
>>>>>     vmovups       %xmm0, 0x20(%rax)
>>>>>
>>>>
>>>> All of these stem from what I think is the same core weakness of the
>>>> current algorithm: we prefer the fully general shufps+shufps 4-way
>>>> shuffle/blend far too often. Here is how I would more precisely classify
>>>> the two things missing here:
>>>>
>>>> - Check if either inputs are "in place" and we can do a fast
>>>> single-input shuffle with a fixed blend.
>>>>
>>>
>>> I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390
>>>
>>>
>>>> - Check if we can form a rotation and use palignr to finish a
>>>> shuffle/blend
>>>>
>>>
>>> .. and this would be  http://llvm.org/bugs/show_bug.cgi?id=22391
>>>
>>> I think this about covers the Haswell regressions I'm seeing.  Now for
>>> some pre-AVX fun!
>>>
>>> -Ahmed
>>>
>>>
>>>> There may be other patterns we're missing, but these two seem to jump
>>>> out based on your analysis, and may be fairly easy to tackle.
>>>>
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/689e6d96/attachment.html>