[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Mon Sep 15 05:57:06 PDT 2014

Andrea, Quentin:

Ok, everything for blendps, insertps, movddup, movsldup, movshdup,
unpcklps, and unpckhps is committed and should generally be working. I've
not tested it *super* thoroughly (will do this ASAP) so if you run into
something fishy, don't burn lots of time on it.

I've also fixed a number of issues I found in the nightly test suite and
things like gcc-loops. I think there are still a couple of regressions I
spotted in the nightly test suite, but haven't gotten to them yet.

I've got very rhudimentary support for pblendw finished and committed.
There is a much more fundamental change that is really needed for pblendw
support though -- currently, the blend lowering strategy assumes this
instruction doesn't exist and thus picks a deeply wrong strategy in some
cases... Not sure how much this is even relevant though.

Anyways, it's almost certainly useful to look into any non-test-suite
benchmarks you have, or to run the benchmarks on non-intel hardware. Let me
know how it goes! So far, with the fixes I've landed recently, I'm seeing
more improvements than regressions on the nightly test suite. =]

-Chandler

On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com
> wrote:

> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com>
> wrote:
> > Awesome, thanks for all the information!
> >
> > See below:
> >
> > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <
> andrea.dibiagio at gmail.com>
> > wrote:
> >>
> >> You have already mentioned how the new shuffle lowering is missing
> >> some features; for example, you explicitly said that we currently lack
> >> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> >> main reasons for the slowdown we are seeing.
> >>
> >> Here is a list of what we found so far that we think is causing most
> >> of the slowdown:
> >> 1) shufps is always emitted in cases where we could emit a single
> >> blendps; in these cases, blendps is preferable because it has better
> >> reciprocal throughput (this is true on all modern Intel and AMD cpus).
> >
> >
> > Yep. I think this is actually super easy. I'll add support for blendps
> > shortly.
>
> Thanks Chandler!
>
> >
> >> 3) When a shuffle performs an insert at index 0 we always generate an
> >> insertps, while a movss would do a better job.
> >> ;;;
> >> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
> >>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
> >> i32 1, i32 2, i32 3>
> >>   ret <4 x float> %1
> >> }
> >> ;;;
> >>
> >> llc (-mcpu=corei7-avx):
> >>   vmovss %xmm1, %xmm0, %xmm0
> >>
> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
> >>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
> >
> >
> > So, this is hard. I think we should do this in MC after register
> allocation
> > because movss is the worst instruction ever: it switches from blending
> with
> > the destination to zeroing the destination when the source switches from
> a
> > register to a memory operand. =[ I would like to not emit movss in the
> DAG
> > *ever*, and teach the MC combine pass to run after register allocation
> (and
> > thus spills) have been emitted. This way we can match both patterns: when
> > insertps is zeroing the other lanes and the operand is from memory, and
> when
> > insertps is blending into the other lanes and the operand is in a
> register.
> >
> > Does that make sense? If so, would you be up for looking at this side of
> > things? It seems nicely separable.
>
> I think it is a good idea and it makes sense to me.
> I will start investigating on this and see what can be done.
>
> Cheers,
> Andrea
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140915/137af0ed/attachment.html>