[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Chandler Carruth chandlerc at google.com
Tue Sep 9 15:39:20 PDT 2014


Awesome, thanks for all the information!

See below:

On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
wrote:

> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
>
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>

Yep. I think this is actually super easy. I'll add support for blendps
shortly.


>
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
>
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>   vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
> xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>   vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>
>
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
>
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>   vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
>
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,
> i32 1, i32 2, i32 3>
>   ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>   vmovss %xmm1, %xmm0, %xmm0
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>

So, this is hard. I think we should do this in MC after register allocation
because movss is the worst instruction ever: it switches from blending with
the destination to zeroing the destination when the source switches from a
register to a memory operand. =[ I would like to not emit movss in the DAG
*ever*, and teach the MC combine pass to run after register allocation (and
thus spills) have been emitted. This way we can match both patterns: when
insertps is zeroing the other lanes and the operand is from memory, and
when insertps is blending into the other lanes and the operand is in a
register.

Does that make sense? If so, would you be up for looking at this side of
things? It seems nicely separable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/b4e5cd53/attachment.html>


More information about the llvm-dev mailing list