<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On May 8, 2014, at 11:25 AM, Nick Lewycky <<a href="mailto:nicholas@mxc.ca">nicholas@mxc.ca</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">Philip Reames wrote:<br><blockquote type="cite"><br>On 05/07/2014 12:14 AM, Nick Lewycky wrote:<br><blockquote type="cite">Philip Reames wrote:<br><blockquote type="cite">Nadav,<br><br>Could you spell out what you believe to be the right approach w.r.t.<br>this topic? I'd also appreciate any links to previous discussion on this<br>topic; a quick google search didn't find much.<br><br>Reading through Filipe's patch, it seems entirely reasonable to me. It's<br>a strict reduction in the number of shufflevectors in the IR. Even given<br>the fact we can't lower shuffle vector perfectly in all cases, I don't<br>see the harm in this patch. I'm not trying to say you're wrong; I'm<br>merely trying to understand what it is I'm not seeing here.<br></blockquote><br>I think the situation is that perfect lowering of arbitrary IR<br>shufflevectors is hard, because the instructions offered by hardware<br>are very particular, because the search space is very large, and<br>because we don't have a good algorithm yet.<br><br>One previous thread is here:<br><a href="http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173544.html">http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173544.html</a><br>and it gets into group theory here:<br>http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130429/173601.html<br>. Even this thread mention another thread the previous week.<br><br>The correct future design for llvm is to let the IR optimizer optimize<br>the shufflevectors however it wants, and leave it to the backend to<br>solve the problem.<br></blockquote>Thanks for the background on the general problem, that was quite helpful.<br><br>After thinking about it a bit, I don't have a good general solution, but<br>I find myself really bothered by the fact that we're avoiding<br>canonicalization at the IR level (which might expose other non-trivial<br>optimizations), because we're not sure how to lower something for some<br>targets. This seems less than desirable.<br><br>Do we have a means to determine whether a particular shufflevector is<br>"trivially" (for some definition thereof) lowerable? If so, we could use<br>that as a profitability check on optimizations like the one Filipe<br>originally proposed. (e.g. if the new sufflevector maps to a single<br>instruction, it's clearly better than two instructions -- hopefully)<br>Given this would be using target information, the optimization might<br>have to be moved later in the optimizer.<br></blockquote><br>One "rule of thumb" I proposed is that there are certain shuffles that we should be able to lower on any architecture:<br></div></blockquote><div><br></div>It make sense. I prefer that we *explicitly* lower shuffles to non-shuffle instructions when possible. For example, converting shuffles to selects, or deleting undef shuffles. We already do this today, but I suspect that we can handle a few more cases. </div><div><br></div><div><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>* the no-op, obviously</div></blockquote><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">* splats</div></blockquote><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">* reverse all elements<br></div></blockquote><div><br></div><div>Some targets can’t handle the reverse pattern very well. If I remember correctly reversing a vector on AVX1 is inefficient.</div><br><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">* any of the above with 'undef' slotted in for any of the elements/indices.<br></div></blockquote><div><br></div><div>Sounds good.</div><div><br></div><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>I forget whether I got this rule into the instcombine transform I made, but I think this is a decent compromise for now.<br><br>Also, frankly, we don't do perfect lowering of non-shuffle instructions either. If we can do a good enough job, then people who would die if they don't get the right machine instructions for vector shuffles can do what we tell all the other people who'd die if they don't get the right machine instructions: use assembly.<br></div></blockquote><div><br></div><div>I am not buying the argument that we need more InstCombine level optimization on shuffles to unblock IR-level optimizations. We already have lots of IR-level shuffle optimizations and the more aggressive optimizations go into SelectionDAG, where target info is available. From my experience in optimizing graphic-ish code, not-optimizing shuffles in InstCombine does not block any loop transformation or memory optimization. This is my own experience shuffles and If people have other experiences and can bring other examples then we should talk about the specifics of where we need to optimize shuffles. </div><div><br></div><div>Nadav</div><div><br></div><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>Nick<br><br><blockquote type="cite">I'm not sure this particular idea is the right approach, but my thought<br>process here is that maybe we can identify some subset of the original<br>optimization which clearly *is* profitable, even given our lowering<br>restrictions. That seems like a logical path forward if we can easily<br>achieve it.<br><br>Philip<br><br><br><br><blockquote type="cite"><br>However, this isn't what we do *today* and that's acceptable today<br>people mostly write vectors in tight kernel loops themselves, or else<br>the autovectorizer picked the exact right sequence for the hardware at<br>hand.<br><br>In the future, those excuses aren't going to work. Code tuned for one<br>microarch will find itself inlined all over a codebase where it will<br>intersect code that was autovectorized to a different vector width,<br>for instance. We need the optimizer to be able to optimize and let the<br>code generator do the instruction selection.<br><br>Nick<br><br><blockquote type="cite">On 05/06/2014 09:22 AM, Nadav Rotem wrote:<br><blockquote type="cite">Hi Filipe,<br><br>Thank you for working on this. I appreciate the fact that you are<br>eager to improve LLVM and that you went through the trouble of testing<br>your code on multiple targets and on considering different<br>alternatives to the problem. I think that the patch below is unrelated<br>to the blend->select lowering so I will only address this patch. We<br>discussed on the possibility of creating new shuffles on the llvm<br>mailing lists more than once. We decided not to create new shuffles<br>because we may create new shuffle patterns that we can’t lower<br>efficiently. I ran into this problem a few times myself in different<br>LLVM-based projects, mainly in the domain of graphics where people<br>write swizzles themselves. Please do not commit this patch because it<br>creates new shuffles. It is not the correct approach. You may get<br>lucky in you selection of testcase, but I am sure that there are lots<br>of other shuffle inputs on X86 and other targets that this patch makes<br>worse.<br><br>Thanks,<br>Nadav<br><br><br><blockquote type="cite">Hi Nadav,<br><br>I'd really like this patch to get in, since it improves code<br>generation on many targets (at the very least, I've tried arm and<br>arm64 (mcpu=cyclone), ppc32 and ppc64 (mcpu=g5), and x86 and x86-64<br>(mcpu=nehalem) and they all improve (they end up with no loads vs. 1<br>or 2 loads, and code size is about half of what it was).<br><br>Even if we'd like to make the select instruction be the proper<br>canonicalization of this operation, it seems that, for now, a<br>shufflevector is better when it comes to lowering it.<br><br>Several targets have the action for the vselect as Expand, which will<br>generate a bunch of code which would then have to be pattern-matched,<br>instead of lowering the vselect as custom and falling-through to<br>Expand if there's no special-case instruction/set of instructions<br>that we can emit.<br><br>I ran opt+llc in this way, for the variants I mentioned above<br>(./bin/opt is opt with this optimization):<br>./bin/opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64<br>-mcpu=cyclone > arm64-cylone-opt<br>opt -O3 select-from-shuffles.ll | ./bin/llc -march=arm64<br>-mcpu=cyclone > arm64-cylone-opt<br><br>I can understand if we eventually drop this after we make the backend<br>treat the select as the canonical form of this operation and have<br>llvm always use that form, but for now it seems that the<br>shufflevector is better lowered in some important targets.<br><br><br>Filipe<br><br><br><br><br>On Fri, May 2, 2014 at 11:55 PM, Nadav Rotem <<a href="mailto:nrotem@apple.com">nrotem@apple.com</a><br><<a href="mailto:nrotem@apple.com">mailto:nrotem@apple.com</a>>> wrote:<br><br>Hi Filipe,<br><br>We don’t generate new shufflevector instructions during<br>optimizations because the lowering of shuffle instructions is<br>really complicated and we don’t want to generate a shuffle that<br>we don’t lower by accident.<br><br>Thanks,<br>Nadav<br><br>On May 2, 2014, at 4:34 PM, Filipe Cabecinhas<br><<a href="mailto:filcab+llvm.phabricator@gmail.com">filcab+llvm.phabricator@gmail.com</a><br><mailto:filcab%<a href="mailto:2Bllvm.phabricator@gmail.com">2Bllvm.phabricator@gmail.com</a>>> wrote:<br><br>> Remove unused argument<br>><br>> <a href="http://reviews.llvm.org/D3561">http://reviews.llvm.org/D3561</a><br>><br>> Files:<br>> lib/Transforms/InstCombine/InstCombineSelect.cpp<br>> test/Transforms/InstCombine/select.ll<br>> <D3561.9050.patch>_______________________________________________<br>> llvm-commits mailing list<br>> <a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a> <<a href="mailto:llvm-commits@cs.uiuc.edu">mailto:llvm-commits@cs.uiuc.edu</a>><br>> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br><br><br></blockquote><br><br><br>_______________________________________________<br>llvm-commits mailing list<br><a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a> <<a href="mailto:llvm-commits@cs.uiuc.edu">mailto:llvm-commits@cs.uiuc.edu</a>><br><a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br></blockquote><br><br><br>_______________________________________________<br>llvm-commits mailing list<br><a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</blockquote></blockquote></blockquote></div></blockquote></div><br></body></html>