[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Wed Sep 17 11:10:21 PDT 2014

Hi Chandler,

Here is a new test case.
With the new lowering, we miss to fold a load into the shuffle.
To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll
llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll

-Quentin

On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at apple.com> wrote:

> Hi Chandler,
> 
> I saw regressions in our internal testing. Some of them are avx/avx2 specific.
> 
> Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected?
> 
> Anyway, here is the biggest offender. This is avx-specific.
> 
> To reproduce:
> llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll
> llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll
> 
> I’ll send more test cases (first for non-avx specific) as I reduce the regressions.
> 
> Thanks,
> -Quentin
> <avx_test_case.ll>
> 
> On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
> 
>> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>> Andrea, Quentin:
>>> 
>>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps,
>>> and unpckhps is committed and should generally be working. I've not tested
>>> it *super* thoroughly (will do this ASAP) so if you run into something
>>> fishy, don't burn lots of time on it.
>> 
>> Ok.
>> 
>>> 
>>> I've also fixed a number of issues I found in the nightly test suite and
>>> things like gcc-loops. I think there are still a couple of regressions I
>>> spotted in the nightly test suite, but haven't gotten to them yet.
>>> 
>>> I've got very rhudimentary support for pblendw finished and committed. There
>>> is a much more fundamental change that is really needed for pblendw support
>>> though -- currently, the blend lowering strategy assumes this instruction
>>> doesn't exist and thus picks a deeply wrong strategy in some cases... Not
>>> sure how much this is even relevant though.
>>> 
>>> 
>>> Anyways, it's almost certainly useful to look into any non-test-suite
>>> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
>>> know how it goes! So far, with the fixes I've landed recently, I'm seeing
>>> more improvements than regressions on the nightly test suite. =]
>> 
>> Cool!
>> I'll have a look at it. I will let you know how it goes.
>> Thanks for working on this :-).
>> 
>> -Andrea
>> 
>>> 
>>> -Chandler
>>> 
>>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
>>> <andrea.dibiagio at gmail.com> wrote:
>>>> 
>>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com>
>>>> wrote:
>>>>> Awesome, thanks for all the information!
>>>>> 
>>>>> See below:
>>>>> 
>>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>>>>> <andrea.dibiagio at gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> You have already mentioned how the new shuffle lowering is missing
>>>>>> some features; for example, you explicitly said that we currently lack
>>>>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>>>>>> main reasons for the slowdown we are seeing.
>>>>>> 
>>>>>> Here is a list of what we found so far that we think is causing most
>>>>>> of the slowdown:
>>>>>> 1) shufps is always emitted in cases where we could emit a single
>>>>>> blendps; in these cases, blendps is preferable because it has better
>>>>>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>>>>> 
>>>>> 
>>>>> Yep. I think this is actually super easy. I'll add support for blendps
>>>>> shortly.
>>>> 
>>>> Thanks Chandler!
>>>> 
>>>>> 
>>>>>> 3) When a shuffle performs an insert at index 0 we always generate an
>>>>>> insertps, while a movss would do a better job.
>>>>>> ;;;
>>>>>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>>>>>>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
>>>>>> i32 1, i32 2, i32 3>
>>>>>>  ret <4 x float> %1
>>>>>> }
>>>>>> ;;;
>>>>>> 
>>>>>> llc (-mcpu=corei7-avx):
>>>>>>  vmovss %xmm1, %xmm0, %xmm0
>>>>>> 
>>>>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>>>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>>>>> 
>>>>> 
>>>>> So, this is hard. I think we should do this in MC after register
>>>>> allocation
>>>>> because movss is the worst instruction ever: it switches from blending
>>>>> with
>>>>> the destination to zeroing the destination when the source switches from
>>>>> a
>>>>> register to a memory operand. =[ I would like to not emit movss in the
>>>>> DAG
>>>>> *ever*, and teach the MC combine pass to run after register allocation
>>>>> (and
>>>>> thus spills) have been emitted. This way we can match both patterns:
>>>>> when
>>>>> insertps is zeroing the other lanes and the operand is from memory, and
>>>>> when
>>>>> insertps is blending into the other lanes and the operand is in a
>>>>> register.
>>>>> 
>>>>> Does that make sense? If so, would you be up for looking at this side of
>>>>> things? It seems nicely separable.
>>>> 
>>>> I think it is a good idea and it makes sense to me.
>>>> I will start investigating on this and see what can be done.
>>>> 
>>>> Cheers,
>>>> Andrea
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: missing_folding.ll
Type: application/octet-stream
Size: 1132 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment-0001.html>