<div dir="ltr"><div class="gmail_extra">I only have one concern here, and it is just a very general concern:</div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Nov 30, 2014 at 1:37 PM, Simon Pilgrim <span dir="ltr"><<a href="mailto:llvm-dev@redking.me.uk" target="_blank">llvm-dev@redking.me.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":1m9" class="a3s" style="overflow:hidden">4i32 shuffles for single insertions into zero vectors lowers to X86vzmovl which was using (v)blendps - causing domain switch stalls. This patch fixes this by using (v)pblendw instead.<br>

<br>

The updated tests on test/CodeGen/X86/sse41.ll still contain a domain stall due to the use of insertps - I'm looking at fixing this in a future patch.<br></div></blockquote><div><br></div><div>Until this is fixed, the test cases have actually regressed because they're still using insertps. =/</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":1m9" class="a3s" style="overflow:hidden">

<br>

Pre-SSE4.1 targets are still affected by a similar domain stall using movss - we could fix this by using 2 x ( punpckldq XMM, zero ) in series - if people agree I'll make a patch for this as well.</div></blockquote></div><br></div><div class="gmail_extra">Yes, I think its important to fix all of these together so we don't see stray regressions when we improve the domain crossing situation, but cause the domain crosses to be less easily hidden by the processors inherent out-of-order execution, hyper threads, etc.</div></div>