<div dir="ltr"><div class="gmail_extra">Awesome, thanks for all the information!</div><div class="gmail_extra"><br></div><div class="gmail_extra">See below:</div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <span dir="ltr"><<a href="mailto:andrea.dibiagio@gmail.com" target="_blank">andrea.dibiagio@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":5bb" class="a3s" style="overflow:hidden">You have already mentioned how the new shuffle lowering is missing<br>

some features; for example, you explicitly said that we currently lack<br>

of SSE4.1 blend support. Unfortunately, this seems to be one of the<br>

main reasons for the slowdown we are seeing.<br>

<br>

Here is a list of what we found so far that we think is causing most<br>

of the slowdown:<br>

1) shufps is always emitted in cases where we could emit a single<br>

blendps; in these cases, blendps is preferable because it has better<br>

reciprocal throughput (this is true on all modern Intel and AMD cpus).<br></div></blockquote><div><br></div><div>Yep. I think this is actually super easy. I'll add support for blendps shortly.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":5bb" class="a3s" style="overflow:hidden">

<br>

Things get worse when it comes to lowering shuffles where the shuffle<br>

mask indices refer to elements from both input vectors in each lane.<br>

For example, a shuffle mask of <0,5,2,7> could be easily lowered into<br>

a single blendps; instead it gets lowered into two shufps<br>

instructions.<br>

<br>

Example:<br>

;;;<br>

define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {<br>

  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0,<br>

i32 5, i32 2, i32 7><br>

  ret <4 x float> %1<br>

}<br>

;;;<br>

<br>

llc (-mcpu=corei7-avx):<br>

  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7]<br>

<br>

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):<br>

  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]<br>

  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]<br>

<br>

<br>

2) On SSE4.1, we should try not to emit an insertps if the shuffle<br>

mask identifies a blend. At the moment the new lowering logic is very<br>

aggressively emitting insertps instead of cheaper blendps.<br>

<br>

Example:<br>

;;;<br>

define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {<br>

  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,<br>

i32 5, i32 2, i32 7><br>

  ret <4 x float> %1<br>

}<br>

;;;<br>

<br>

llc (-mcpu=corei7-avx):<br>

  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]<br>

<br>

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):<br>

  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]<br>

<br>

<br>

3) When a shuffle performs an insert at index 0 we always generate an<br>

insertps, while a movss would do a better job.<br>

;;;<br>

define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {<br>

  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,<br>

i32 1, i32 2, i32 3><br>

  ret <4 x float> %1<br>

}<br>

;;;<br>

<br>

llc (-mcpu=corei7-avx):<br>

  vmovss %xmm1, %xmm0, %xmm0<br>

<br>

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):<br>

  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]<br></div></blockquote><div><br></div><div>So, this is hard. I think we should do this in MC after register allocation because movss is the worst instruction ever: it switches from blending with the destination to zeroing the destination when the source switches from a register to a memory operand. =[ I would like to not emit movss in the DAG *ever*, and teach the MC combine pass to run after register allocation (and thus spills) have been emitted. This way we can match both patterns: when insertps is zeroing the other lanes and the operand is from memory, and when insertps is blending into the other lanes and the operand is in a register.</div><div><br></div><div>Does that make sense? If so, would you be up for looking at this side of things? It seems nicely separable.</div></div><br></div></div>