<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Jan 16, 2015, at 3:04 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com" class="">chandlerc@google.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_extra"><br class=""><div class="gmail_quote">On Fri, Jan 16, 2015 at 2:40 PM, Quentin Colombet <span dir="ltr" class=""><<a href="mailto:qcolombet@apple.com" target="_blank" class="">qcolombet@apple.com</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">Well, that may be the conclusion: The performance impact may be within the noise.</div><div class="">Since this kind of patterns are very specific, this is not surprising.</div><div class="">For the record, I tend to ignore the tests that run for less than 1 second (too noisy). Then, the noise level is usually around 1% on a quiet computer with fixed frequency, which is not too bad. </div></blockquote></div><br class="">Numbers would mostly be nice because I don't know if other targets have the thing that makes this such a huge win on x86 -- implicit concat with undef to form 2x-wide vectors.</div><div class="gmail_extra"><br class=""></div><div class="gmail_extra">This may be an x86-specific win, in which case it should just be added as a target-specific combine.</div></div>

</div></blockquote></div><br class=""><div class="">Isn’t that typical of SIMD architectures in general? That is, if an arch supports both N and 2N vector sizes, an operation on size-N vectors typically clears the top half, right? Or on armv7-like architectures you can modify d0 and then address q0, right? I’m not super familiar with any architectures other than ARMv7 NEON and SSE/AVX that support multiple native sizes though, so correct me if I’m wrong!</div><div class=""><br class=""></div><div class="">I guess the worst case would be something like this:</div><div class=""><br class=""></div><div class="">old pseudocode:</div><div class=""><br class=""></div><div class="">concat xmm2, xmm0, xmm1</div><div class="">shuffle ymm3, ymm2</div><div class=""><br class=""></div><div class="">new pseudocode:</div><div class=""><br class=""></div><div class="">shuffle xmm2, xmm0, xmm1</div><div class="">concat xmm2, xmm3</div><div class=""><br class=""></div><div class="">If the implicit concat isn’t there, and the architecture has no benefit to using smaller shuffles, and the architecture has no two-source shuffle for reasonable element sizes, I guess it could end up with an extra op?</div><div class=""><br class=""></div><div class="">Fiona</div></body></html>