[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Wed Sep 17 10:28:20 PDT 2014

Hi Chandler,

I saw regressions in our internal testing. Some of them are avx/avx2 specific.

Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected?

Anyway, here is the biggest offender. This is avx-specific.

To reproduce:
llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll
llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll

I’ll send more test cases (first for non-avx specific) as I reduce the regressions.

Thanks,
-Quentin

On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote:
>> Andrea, Quentin:
>> 
>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps,
>> and unpckhps is committed and should generally be working. I've not tested
>> it *super* thoroughly (will do this ASAP) so if you run into something
>> fishy, don't burn lots of time on it.
> 
> Ok.
> 
>> 
>> I've also fixed a number of issues I found in the nightly test suite and
>> things like gcc-loops. I think there are still a couple of regressions I
>> spotted in the nightly test suite, but haven't gotten to them yet.
>> 
>> I've got very rhudimentary support for pblendw finished and committed. There
>> is a much more fundamental change that is really needed for pblendw support
>> though -- currently, the blend lowering strategy assumes this instruction
>> doesn't exist and thus picks a deeply wrong strategy in some cases... Not
>> sure how much this is even relevant though.
>> 
>> 
>> Anyways, it's almost certainly useful to look into any non-test-suite
>> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
>> know how it goes! So far, with the fixes I've landed recently, I'm seeing
>> more improvements than regressions on the nightly test suite. =]
> 
> Cool!
> I'll have a look at it. I will let you know how it goes.
> Thanks for working on this :-).
> 
> -Andrea
> 
>> 
>> -Chandler
>> 
>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio
>> <andrea.dibiagio at gmail.com> wrote:
>>> 
>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com>
>>> wrote:
>>>> Awesome, thanks for all the information!
>>>> 
>>>> See below:
>>>> 
>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio
>>>> <andrea.dibiagio at gmail.com>
>>>> wrote:
>>>>> 
>>>>> You have already mentioned how the new shuffle lowering is missing
>>>>> some features; for example, you explicitly said that we currently lack
>>>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>>>>> main reasons for the slowdown we are seeing.
>>>>> 
>>>>> Here is a list of what we found so far that we think is causing most
>>>>> of the slowdown:
>>>>> 1) shufps is always emitted in cases where we could emit a single
>>>>> blendps; in these cases, blendps is preferable because it has better
>>>>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>>>> 
>>>> 
>>>> Yep. I think this is actually super easy. I'll add support for blendps
>>>> shortly.
>>> 
>>> Thanks Chandler!
>>> 
>>>> 
>>>>> 3) When a shuffle performs an insert at index 0 we always generate an
>>>>> insertps, while a movss would do a better job.
>>>>> ;;;
>>>>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>>>>>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
>>>>> i32 1, i32 2, i32 3>
>>>>>  ret <4 x float> %1
>>>>> }
>>>>> ;;;
>>>>> 
>>>>> llc (-mcpu=corei7-avx):
>>>>>  vmovss %xmm1, %xmm0, %xmm0
>>>>> 
>>>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>>>> 
>>>> 
>>>> So, this is hard. I think we should do this in MC after register
>>>> allocation
>>>> because movss is the worst instruction ever: it switches from blending
>>>> with
>>>> the destination to zeroing the destination when the source switches from
>>>> a
>>>> register to a memory operand. =[ I would like to not emit movss in the
>>>> DAG
>>>> *ever*, and teach the MC combine pass to run after register allocation
>>>> (and
>>>> thus spills) have been emitted. This way we can match both patterns:
>>>> when
>>>> insertps is zeroing the other lanes and the operand is from memory, and
>>>> when
>>>> insertps is blending into the other lanes and the operand is in a
>>>> register.
>>>> 
>>>> Does that make sense? If so, would you be up for looking at this side of
>>>> things? It seems nicely separable.
>>> 
>>> I think it is a good idea and it makes sense to me.
>>> I will start investigating on this and see what can be done.
>>> 
>>> Cheers,
>>> Andrea

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/b236616b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: avx_test_case.ll
Type: application/octet-stream
Size: 3249 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/b236616b/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/b236616b/attachment-0001.html>