<div dir="ltr">Andrea, Quentin:<div><br></div><div>Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps, and unpckhps is committed and should generally be working. I've not tested it *super* thoroughly (will do this ASAP) so if you run into something fishy, don't burn lots of time on it.</div><div><br></div><div>I've also fixed a number of issues I found in the nightly test suite and things like gcc-loops. I think there are still a couple of regressions I spotted in the nightly test suite, but haven't gotten to them yet.</div><div><br></div><div>I've got very rhudimentary support for pblendw finished and committed. There is a much more fundamental change that is really needed for pblendw support though -- currently, the blend lowering strategy assumes this instruction doesn't exist and thus picks a deeply wrong strategy in some cases... Not sure how much this is even relevant though.</div><div><br></div><div><br></div><div>Anyways, it's almost certainly useful to look into any non-test-suite benchmarks you have, or to run the benchmarks on non-intel hardware. Let me know how it goes! So far, with the fixes I've landed recently, I'm seeing more improvements than regressions on the nightly test suite. =]</div><div><br></div><div>-Chandler</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio <span dir="ltr"><<a href="mailto:andrea.dibiagio@gmail.com" target="_blank">andrea.dibiagio@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>> wrote:<br>

> Awesome, thanks for all the information!<br>

><br>

> See below:<br>

><br>

> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <<a href="mailto:andrea.dibiagio@gmail.com">andrea.dibiagio@gmail.com</a>><br>

> wrote:<br>

>><br>

>> You have already mentioned how the new shuffle lowering is missing<br>

>> some features; for example, you explicitly said that we currently lack<br>

>> of SSE4.1 blend support. Unfortunately, this seems to be one of the<br>

>> main reasons for the slowdown we are seeing.<br>

>><br>

>> Here is a list of what we found so far that we think is causing most<br>

>> of the slowdown:<br>

>> 1) shufps is always emitted in cases where we could emit a single<br>

>> blendps; in these cases, blendps is preferable because it has better<br>

>> reciprocal throughput (this is true on all modern Intel and AMD cpus).<br>

><br>

><br>

> Yep. I think this is actually super easy. I'll add support for blendps<br>

> shortly.<br>

<br>

</span>Thanks Chandler!<br>

<span class=""><br>

><br>

>> 3) When a shuffle performs an insert at index 0 we always generate an<br>

>> insertps, while a movss would do a better job.<br>

>> ;;;<br>

>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {<br>

>>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,<br>

>> i32 1, i32 2, i32 3><br>

>>   ret <4 x float> %1<br>

>> }<br>

>> ;;;<br>

>><br>

>> llc (-mcpu=corei7-avx):<br>

>>   vmovss %xmm1, %xmm0, %xmm0<br>

>><br>

>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):<br>

>>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]<br>

><br>

><br>

> So, this is hard. I think we should do this in MC after register allocation<br>

> because movss is the worst instruction ever: it switches from blending with<br>

> the destination to zeroing the destination when the source switches from a<br>

> register to a memory operand. =[ I would like to not emit movss in the DAG<br>

> *ever*, and teach the MC combine pass to run after register allocation (and<br>

> thus spills) have been emitted. This way we can match both patterns: when<br>

> insertps is zeroing the other lanes and the operand is from memory, and when<br>

> insertps is blending into the other lanes and the operand is in a register.<br>

><br>

> Does that make sense? If so, would you be up for looking at this side of<br>

> things? It seems nicely separable.<br>

<br>

</span>I think it is a good idea and it makes sense to me.<br>

I will start investigating on this and see what can be done.<br>

<br>

Cheers,<br>

Andrea<br>

</blockquote></div><br></div>