<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 2, 2014 at 12:00 PM, Arnold Schwaighofer <span dir="ltr"><<a href="mailto:aschwaighofer@apple.com" target="_blank">aschwaighofer@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>To clarify, I agree with Andy that we can run into phase ordering problems if we implement this as a separate pass.</div>

</blockquote><div><br></div><div>That should largely be handled by doing this after the SLP vectorizer can search for profitable things? My feeling was that if this is separate, it will always be strictly less powerful / interesting than whatever the SLP vectorizer can do. So we let the SLP vectorizer have the first shot. (This of course doesn't solve the phase ordering problem across inlining or other complex iterations, but those seem less worrisome.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>What I wanted to say above is that if we model this transformation in the SLP vectorizer then we should not have phase ordering problems. However, I don’t think modeling this in the SLP vectorizer (adding complexity) is justified just by swap (we should do this as a dagcombine). </div>

</blockquote><div><br></div><div>I don't think bswap justifies much of anything FWIW. We can fix bswap in a myriad of ways.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div> If on the other hand we expect longer chains leading to a load that could be vectorized then it might make sense thinking about adding complexity to the slp vectorizer.</div></blockquote></div><div class="gmail_extra">

<br></div>I see a lot of really horrible code where people manually load a 32-bit or 64-bit integer, and then extract bytes, bits, or other sub-regions of it. This code invariably has comments about how doing these contortions is essential to getting decent performance. My motivation is to ensure that the optimizer can and does handle these cases so that programmers can write the more boring code and stop worrying. I suspect there are quite a few encoding things that would benefit from *some* combining.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">That doesn't mean it is worth the complexity of teaching it to the SLP vectorizer, it just means that it seems worth *some* complexity beyond a point fix for bswap. Based on the challenges you and others have described, starting off with a simple and boring pass for this which still gets exposed to instcombine and friends seems like a good initial point in the tradeoff space.<br>

<br></div></div>