[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Wed Sep 10 03:36:09 PDT 2014

On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> wrote:
> Awesome, thanks for all the information!
>
> See below:
>
> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>>
>> You have already mentioned how the new shuffle lowering is missing
>> some features; for example, you explicitly said that we currently lack
>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>> main reasons for the slowdown we are seeing.
>>
>> Here is a list of what we found so far that we think is causing most
>> of the slowdown:
>> 1) shufps is always emitted in cases where we could emit a single
>> blendps; in these cases, blendps is preferable because it has better
>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
>
> Yep. I think this is actually super easy. I'll add support for blendps
> shortly.

Thanks Chandler!

>
>> 3) When a shuffle performs an insert at index 0 we always generate an
>> insertps, while a movss would do a better job.
>> ;;;
>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
>> i32 1, i32 2, i32 3>
>>   ret <4 x float> %1
>> }
>> ;;;
>>
>> llc (-mcpu=corei7-avx):
>>   vmovss %xmm1, %xmm0, %xmm0
>>
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
>
> So, this is hard. I think we should do this in MC after register allocation
> because movss is the worst instruction ever: it switches from blending with
> the destination to zeroing the destination when the source switches from a
> register to a memory operand. =[ I would like to not emit movss in the DAG
> *ever*, and teach the MC combine pass to run after register allocation (and
> thus spills) have been emitted. This way we can match both patterns: when
> insertps is zeroing the other lanes and the operand is from memory, and when
> insertps is blending into the other lanes and the operand is in a register.
>
> Does that make sense? If so, would you be up for looking at this side of
> things? It seems nicely separable.

I think it is a good idea and it makes sense to me.
I will start investigating on this and see what can be done.

Cheers,
Andrea