<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jan 26, 2015 at 9:38 AM, Quentin Colombet <span dir="ltr"><<a href="mailto:qcolombet@apple.com" target="_blank">qcolombet@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":cha" class="a3s" style="overflow:hidden">Hi Bruno,<br>

<br>

I am not sure this is the right thing to do.<br>

Do you see any performance improvement with the new sequence?<br>

<br>

My concern here is, with the new sequence, we have a complete linear sequence of instructions whereas the old sequence can be partly parallelized. Running both the new and old sequence through IACA, I see the following throughputs:<br>

Old:<br>

<br>

- Sandy Bridge: 6.15 cycles.<br>

- Ivy Bridge: 6.15 cycles.<br>

- Haswell: 12 cycles.<br>

<br>

New:<br>

<br>

- Sandy Bridge: 13 cycles.<br>

- Ivy Bridge: 13 cycles.<br>

- Haswell: 13 cycles.<br>

<br>

This seems to concur my hypothesis.</div></blockquote></div><br>FWIW, this matches my experience. I have seen pinsrw and pextrw chains have really been astonishingly slow to execute.</div></div>