[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Quentin Colombet
qcolombet at apple.com
Wed Sep 17 12:51:19 PDT 2014
Hi Chandler,
Yet another test case :).
We use two shuffles instead of 1 palign.
To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true missing_palign.ll -mcpu=core2
llc -x86-experimental-vector-shuffle-lowering=false missing_palign.ll -mcpu=core2
You can replace -mcpu=core2 by -mattr=+ssse3.
Q.
On Sep 17, 2014, at 11:10 AM, Quentin Colombet <qcolombet at apple.com> wrote:
> Hi Chandler,
>
> Here is a new test case.
> With the new lowering, we miss to fold a load into the shuffle.
> To reproduce:
> llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll
> llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll
>
> -Quentin
> <missing_folding.ll>
>
> On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at apple.com> wrote:
>
>> Hi Chandler,
>>
>> I saw regressions in our internal testing. Some of them are avx/avx2 specific.
>>
>> Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected?
>>
>> Anyway, here is the biggest offender. This is avx-specific.
>>
>> To reproduce:
>> llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll
>> llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll
>>
>> I’ll send more test cases (first for non-avx specific) as I reduce the regressions.
>>
>> Thanks,
>> -Quentin
>> <avx_test_case.ll>
>>
>> On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>
>>> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>> Andrea, Quentin:
>>>>
>>>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps,
>>>> and unpckhps is committed and should generally be working. I've not tested
>>>> it *super* thoroughly (will do this ASAP) so if you run into something
>>>> fishy, don't burn lots of time on it.
>>>
>>> Ok.
>>>
>>>>
>>>> I've also fixed a number of issues I found in the nightly test suite and
>>>> things like gcc-loops. I think there are still a couple of regressions I
>>>> spotted in the nightly test suite, but haven't gotten to them yet.
>>>>
>>>> I've got very rhudimentary support for pblendw finished and committed. There
>>>> is a much more fundamental change that is really needed for pblendw support
>>>> though -- currently, the blend lowering strategy assumes this instruction
>>>> doesn't exist and thus picks a deeply wrong strategy in some cases... Not
>>>> sure how much this is even relevant though.
>>>>
>>>>
>>>> Anyways, it's almost certainly useful to look into any non-test-suite
>>>> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
>>>> know how it goes! So far, with the fixes I've landed recently, I'm seeing
>>>> more improvements than regressions on the nightly test suite. =]
>>>
>>> Cool!
>>> I'll have a look at it. I will let you know how it goes.
>>> Thanks for working on this :-).
>>>
>>> -Andrea
>>>
>>>>
>>>> -Chandler
>>>>
>>>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com> wrote:
>>>>>
>>>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com>
>>>>> wrote:
>>>>>> Awesome, thanks for all the information!
>>>>>>
>>>>>> See below:
>>>>>>
>>>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>>>>>> <andrea.dibiagio at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> You have already mentioned how the new shuffle lowering is missing
>>>>>>> some features; for example, you explicitly said that we currently lack
>>>>>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>>>>>>> main reasons for the slowdown we are seeing.
>>>>>>>
>>>>>>> Here is a list of what we found so far that we think is causing most
>>>>>>> of the slowdown:
>>>>>>> 1) shufps is always emitted in cases where we could emit a single
>>>>>>> blendps; in these cases, blendps is preferable because it has better
>>>>>>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>>>>>>
>>>>>>
>>>>>> Yep. I think this is actually super easy. I'll add support for blendps
>>>>>> shortly.
>>>>>
>>>>> Thanks Chandler!
>>>>>
>>>>>>
>>>>>>> 3) When a shuffle performs an insert at index 0 we always generate an
>>>>>>> insertps, while a movss would do a better job.
>>>>>>> ;;;
>>>>>>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>>>>>>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
>>>>>>> i32 1, i32 2, i32 3>
>>>>>>> ret <4 x float> %1
>>>>>>> }
>>>>>>> ;;;
>>>>>>>
>>>>>>> llc (-mcpu=corei7-avx):
>>>>>>> vmovss %xmm1, %xmm0, %xmm0
>>>>>>>
>>>>>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>>>>> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>>>>>>
>>>>>>
>>>>>> So, this is hard. I think we should do this in MC after register
>>>>>> allocation
>>>>>> because movss is the worst instruction ever: it switches from blending
>>>>>> with
>>>>>> the destination to zeroing the destination when the source switches from
>>>>>> a
>>>>>> register to a memory operand. =[ I would like to not emit movss in the
>>>>>> DAG
>>>>>> *ever*, and teach the MC combine pass to run after register allocation
>>>>>> (and
>>>>>> thus spills) have been emitted. This way we can match both patterns:
>>>>>> when
>>>>>> insertps is zeroing the other lanes and the operand is from memory, and
>>>>>> when
>>>>>> insertps is blending into the other lanes and the operand is in a
>>>>>> register.
>>>>>>
>>>>>> Does that make sense? If so, would you be up for looking at this side of
>>>>>> things? It seems nicely separable.
>>>>>
>>>>> I think it is a good idea and it makes sense to me.
>>>>> I will start investigating on this and see what can be done.
>>>>>
>>>>> Cheers,
>>>>> Andrea
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: missing_palign.ll
Type: application/octet-stream
Size: 1430 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment-0001.html>
More information about the llvm-dev
mailing list