<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Jul 7, 2014 at 10:41 AM, Evan Cheng <span dir="ltr"><<a href="mailto:evan.cheng@apple.com" target="_blank">evan.cheng@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Have you considered using / extending perfect shuffle to replace some of the logic?<br></blockquote><div><br></div><div>

Sorry for not replying promptly here.</div><div><br></div><div>I did think about perfect shuffles, but I don't think they're going to help much. Fundamentally, perfect shuffle tables don't scale well enough to be usable. For just 8 lanes, we're talking about nearly 7 billion entries. Even assuming a *bunch* of folding through symmetry and other tricks, we're not going to get it to a reasonable. So we can only use perfect shuffle tables for 4 lanes and smaller. But for 4 lanes and smaller x86 has essentially perfect shuffle *instructions* and all the tricky parts are balancing blend vs. shuffle operations and the potential for domain crossing penalties. Those seem reasonably handled by basic code logic and DAG combines rather than table-driven approaches.</div>

<div><br></div><div>The other thing that I realized that led me down this path is that there is a very fundamental logic to the shuffle instructions on the architecture, and the best way to lower shuffle operations is to actually follow that logic itself. That leads to the decomposition structure of the code here.</div>

<div><br></div><div>-Chandler</div></div></div></div>