[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Wed Sep 17 12:51:19 PDT 2014

Hi Chandler,

Yet another test case :).
We use two shuffles instead of 1 palign.

To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true missing_palign.ll -mcpu=core2
llc -x86-experimental-vector-shuffle-lowering=false missing_palign.ll -mcpu=core2

You can replace -mcpu=core2 by -mattr=+ssse3.

Q.

On Sep 17, 2014, at 11:10 AM, Quentin Colombet <qcolombet at apple.com> wrote:

> Hi Chandler,
> 
> Here is a new test case.
> With the new lowering, we miss to fold a load into the shuffle.
> To reproduce:
> llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll
> llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll
> 
> -Quentin
> <missing_folding.ll>
> 
> On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at apple.com> wrote:
> 
>> Hi Chandler,
>> 
>> I saw regressions in our internal testing. Some of them are avx/avx2 specific.
>> 
>> Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected?
>> 
>> Anyway, here is the biggest offender. This is avx-specific.
>> 
>> To reproduce:
>> llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll
>> llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll
>> 
>> I’ll send more test cases (first for non-avx specific) as I reduce the regressions.
>> 
>> Thanks,
>> -Quentin
>> <avx_test_case.ll>
>> 
>> On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>> 
>>> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>> Andrea, Quentin:
>>>> 
>>>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps,
>>>> and unpckhps is committed and should generally be working. I've not tested
>>>> it *super* thoroughly (will do this ASAP) so if you run into something
>>>> fishy, don't burn lots of time on it.
>>> 
>>> Ok.
>>> 
>>>> 
>>>> I've also fixed a number of issues I found in the nightly test suite and
>>>> things like gcc-loops. I think there are still a couple of regressions I
>>>> spotted in the nightly test suite, but haven't gotten to them yet.
>>>> 
>>>> I've got very rhudimentary support for pblendw finished and committed. There
>>>> is a much more fundamental change that is really needed for pblendw support
>>>> though -- currently, the blend lowering strategy assumes this instruction
>>>> doesn't exist and thus picks a deeply wrong strategy in some cases... Not
>>>> sure how much this is even relevant though.
>>>> 
>>>> 
>>>> Anyways, it's almost certainly useful to look into any non-test-suite
>>>> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
>>>> know how it goes! So far, with the fixes I've landed recently, I'm seeing
>>>> more improvements than regressions on the nightly test suite. =]
>>> 
>>> Cool!
>>> I'll have a look at it. I will let you know how it goes.
>>> Thanks for working on this :-).
>>> 
>>> -Andrea
>>> 
>>>> 
>>>> -Chandler
>>>> 
>>>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>> 
>>>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com>
>>>>> wrote:
>>>>>> Awesome, thanks for all the information!
>>>>>> 
>>>>>> See below:
>>>>>> 
>>>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>>>>>> <andrea.dibiagio at gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> You have already mentioned how the new shuffle lowering is missing
>>>>>>> some features; for example, you explicitly said that we currently lack
>>>>>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>>>>>>> main reasons for the slowdown we are seeing.
>>>>>>> 
>>>>>>> Here is a list of what we found so far that we think is causing most
>>>>>>> of the slowdown:
>>>>>>> 1) shufps is always emitted in cases where we could emit a single
>>>>>>> blendps; in these cases, blendps is preferable because it has better
>>>>>>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>>>>>> 
>>>>>> 
>>>>>> Yep. I think this is actually super easy. I'll add support for blendps
>>>>>> shortly.
>>>>> 
>>>>> Thanks Chandler!
>>>>> 
>>>>>> 
>>>>>>> 3) When a shuffle performs an insert at index 0 we always generate an
>>>>>>> insertps, while a movss would do a better job.
>>>>>>> ;;;
>>>>>>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>>>>>>>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
>>>>>>> i32 1, i32 2, i32 3>
>>>>>>>  ret <4 x float> %1
>>>>>>> }
>>>>>>> ;;;
>>>>>>> 
>>>>>>> llc (-mcpu=corei7-avx):
>>>>>>>  vmovss %xmm1, %xmm0, %xmm0
>>>>>>> 
>>>>>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>>>>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>>>>>> 
>>>>>> 
>>>>>> So, this is hard. I think we should do this in MC after register
>>>>>> allocation
>>>>>> because movss is the worst instruction ever: it switches from blending
>>>>>> with
>>>>>> the destination to zeroing the destination when the source switches from
>>>>>> a
>>>>>> register to a memory operand. =[ I would like to not emit movss in the
>>>>>> DAG
>>>>>> *ever*, and teach the MC combine pass to run after register allocation
>>>>>> (and
>>>>>> thus spills) have been emitted. This way we can match both patterns:
>>>>>> when
>>>>>> insertps is zeroing the other lanes and the operand is from memory, and
>>>>>> when
>>>>>> insertps is blending into the other lanes and the operand is in a
>>>>>> register.
>>>>>> 
>>>>>> Does that make sense? If so, would you be up for looking at this side of
>>>>>> things? It seems nicely separable.
>>>>> 
>>>>> I think it is a good idea and it makes sense to me.
>>>>> I will start investigating on this and see what can be done.
>>>>> 
>>>>> Cheers,
>>>>> Andrea
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: missing_palign.ll
Type: application/octet-stream
Size: 1430 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment-0001.html>