<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><div></div><div><div>On Sep 17, 2014, at 11:10 AM, Quentin Colombet <<a href="mailto:qcolombet@apple.com">qcolombet@apple.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi Chandler,<div><br></div><div>Here is a new test case.</div><div>With the new lowering, we miss to fold a load into the shuffle.</div><div><div>To reproduce:</div><div>llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll</div><div>llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll</div></div><div><br></div><div>-Quentin</div><div></div></div>

<span><missing_folding.ll></span><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Sep 17, 2014, at 10:28 AM, Quentin Colombet <<a href="mailto:qcolombet@apple.com">qcolombet@apple.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><meta http-equiv="Content-Type" content="text/html charset=windows-1252"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">Hi Chandler,<div><br></div><div>I saw regressions in our internal testing. Some of them are avx/avx2 specific.</div><div><br></div><div>Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected?</div><div><br></div><div>Anyway, here is the biggest offender. This is avx-specific.</div><div><br></div><div>To reproduce:</div><div><div>llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll</div><div>llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll</div></div><div><br></div><div>I’ll send more test cases (first for non-avx specific) as I reduce the regressions.</div><div><br></div><div>Thanks,</div><div>-Quentin</div><div></div></div><span><avx_test_case.ll></span><meta http-equiv="Content-Type" content="text/html charset=us-ascii"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <<a href="mailto:andrea.dibiagio@gmail.com">andrea.dibiagio@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>> wrote:<br><blockquote type="cite">Andrea, Quentin:<br><br>Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps,<br>and unpckhps is committed and should generally be working. I've not tested<br>it *super* thoroughly (will do this ASAP) so if you run into something<br>fishy, don't burn lots of time on it.<br></blockquote><br>Ok.<br><br><blockquote type="cite"><br>I've also fixed a number of issues I found in the nightly test suite and<br>things like gcc-loops. I think there are still a couple of regressions I<br>spotted in the nightly test suite, but haven't gotten to them yet.<br><br>I've got very rhudimentary support for pblendw finished and committed. There<br>is a much more fundamental change that is really needed for pblendw support<br>though -- currently, the blend lowering strategy assumes this instruction<br>doesn't exist and thus picks a deeply wrong strategy in some cases... Not<br>sure how much this is even relevant though.<br><br><br>Anyways, it's almost certainly useful to look into any non-test-suite<br>benchmarks you have, or to run the benchmarks on non-intel hardware. Let me<br>know how it goes! So far, with the fixes I've landed recently, I'm seeing<br>more improvements than regressions on the nightly test suite. =]<br></blockquote><br>Cool!<br>I'll have a look at it. I will let you know how it goes.<br>Thanks for working on this :-).<br><br>-Andrea<br><br><blockquote type="cite"><br>-Chandler<br><br>On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio<br><<a href="mailto:andrea.dibiagio@gmail.com">andrea.dibiagio@gmail.com</a>> wrote:<br><blockquote type="cite"><br>On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>><br>wrote:<br><blockquote type="cite">Awesome, thanks for all the information!<br><br>See below:<br><br>On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio<br><<a href="mailto:andrea.dibiagio@gmail.com">andrea.dibiagio@gmail.com</a>><br>wrote:<br><blockquote type="cite"><br>You have already mentioned how the new shuffle lowering is missing<br>some features; for example, you explicitly said that we currently lack<br>of SSE4.1 blend support. Unfortunately, this seems to be one of the<br>main reasons for the slowdown we are seeing.<br><br>Here is a list of what we found so far that we think is causing most<br>of the slowdown:<br>1) shufps is always emitted in cases where we could emit a single<br>blendps; in these cases, blendps is preferable because it has better<br>reciprocal throughput (this is true on all modern Intel and AMD cpus).<br></blockquote><br><br>Yep. I think this is actually super easy. I'll add support for blendps<br>shortly.<br></blockquote><br>Thanks Chandler!<br><br><blockquote type="cite"><br><blockquote type="cite">3) When a shuffle performs an insert at index 0 we always generate an<br>insertps, while a movss would do a better job.<br>;;;<br>define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {<br> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4,<br>i32 1, i32 2, i32 3><br> ret <4 x float> %1<br>}<br>;;;<br><br>llc (-mcpu=corei7-avx):<br> vmovss %xmm1, %xmm0, %xmm0<br><br>llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):<br> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]<br></blockquote><br><br>So, this is hard. I think we should do this in MC after register<br>allocation<br>because movss is the worst instruction ever: it switches from blending<br>with<br>the destination to zeroing the destination when the source switches from<br>a<br>register to a memory operand. =[ I would like to not emit movss in the<br>DAG<br>*ever*, and teach the MC combine pass to run after register allocation<br>(and<br>thus spills) have been emitted. This way we can match both patterns:<br>when<br>insertps is zeroing the other lanes and the operand is from memory, and<br>when<br>insertps is blending into the other lanes and the operand is in a<br>register.<br><br>Does that make sense? If so, would you be up for looking at this side of<br>things? It seems nicely separable.<br></blockquote><br>I think it is a good idea and it makes sense to me.<br>I will start investigating on this and see what can be done.<br><br>Cheers,<br>Andrea</blockquote></blockquote></div></blockquote></div><br></div>_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu/">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br></blockquote></div><br></div>_______________________________________________<br>LLVM Developers mailing list<br><a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br></blockquote></div><br></div></body></html>