<div dir="ltr">Andrea, Quentin:<div><br></div><div>Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps, and unpckhps is committed and should generally be working. I've not tested it *super* thoroughly (will do this ASAP) so if you run into something fishy, don't burn lots of time on it.</div><div><br></div><div>I've also fixed a number of issues I found in the nightly test suite and things like gcc-loops. I think there are still a couple of regressions I spotted in the nightly test suite, but haven't gotten to them yet.</div><div><br></div><div>I've got very rhudimentary support for pblendw finished and committed. There is a much more fundamental change that is really needed for pblendw support though -- currently, the blend lowering strategy assumes this instruction doesn't exist and thus picks a deeply wrong strategy in some cases... Not sure how much this is even relevant though.</div><div><br></div><div><br></div><div>Anyways, it's almost certainly useful to look into any non-test-suite benchmarks you have, or to run the benchmarks on non-intel hardware. Let me know how it goes! So far, with the fixes I've landed recently, I'm seeing more improvements than regressions on the nightly test suite. =]</div><div><br></div><div>-Chandler</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio <span dir="ltr"><<a href="mailto:andrea.dibiagio@gmail.com" target="_blank">andrea.dibiagio@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>> wrote:<br>
> Awesome, thanks for all the information!<br>
><br>
> See below:<br>
><br>
> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <<a href="mailto:andrea.dibiagio@gmail.com">andrea.dibiagio@gmail.com</a>><br>
> wrote:<br>
>><br>
>> You have already mentioned how the new shuffle lowering is missing<br>
>> some features; for example, you explicitly said that we currently lack<br>
>> of SSE4.1 blend support. Unfortunately, this seems to be one of the<br>
>> main reasons for the slowdown we are seeing.<br>
>><br>
>> Here is a list of what we found so far that we think is causing most<br>
>> of the slowdown:<br>
>> 1) shufps is always emitted in cases where we could emit a single<br>
>> blendps; in these cases, blendps is preferable because it has better<br>
>> reciprocal throughput (this is true on all modern Intel and AMD cpus).<br>
><br>
><br>
> Yep. I think this is actually super easy. I'll add support for blendps<br>
> shortly.<br>
<br>
</span>Thanks Chandler!<br>
<span class=""><br>
><br>
>> 3) When a shuffle performs an insert at index 0 we always generate an<br>
>> insertps, while a movss would do a better job.<br>
>> ;;;<br>
>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {<br>
>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,<br>
>> i32 1, i32 2, i32 3><br>
>> ret <4 x float> %1<br>
>> }<br>
>> ;;;<br>
>><br>
>> llc (-mcpu=corei7-avx):<br>
>> vmovss %xmm1, %xmm0, %xmm0<br>
>><br>
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):<br>
>> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]<br>
><br>
><br>
> So, this is hard. I think we should do this in MC after register allocation<br>
> because movss is the worst instruction ever: it switches from blending with<br>
> the destination to zeroing the destination when the source switches from a<br>
> register to a memory operand. =[ I would like to not emit movss in the DAG<br>
> *ever*, and teach the MC combine pass to run after register allocation (and<br>
> thus spills) have been emitted. This way we can match both patterns: when<br>
> insertps is zeroing the other lanes and the operand is from memory, and when<br>
> insertps is blending into the other lanes and the operand is in a register.<br>
><br>
> Does that make sense? If so, would you be up for looking at this side of<br>
> things? It seems nicely separable.<br>
<br>
</span>I think it is a good idea and it makes sense to me.<br>
I will start investigating on this and see what can be done.<br>
<br>
Cheers,<br>
Andrea<br>
</blockquote></div><br></div>