<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jan 16, 2015 at 3:23 PM, Fiona Glaser <span dir="ltr"><<a href="mailto:fglaser@apple.com" target="_blank">fglaser@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> On Jan 16, 2015, at 3:13 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>> wrote:<br>

><br>

> Cool, this looks good to me provided your data indicates this pattern works well on other targets as well. =]<br>

><br>

> Thanks for working on it! Any chance you can also look at the fact that we use vmovq here rather than vmovlpd?<br>

<br>

</span>I would think vmovq is more correct; it zeroes out the rest of the register while vmovlpd doesn’t, so vmovlpd would create a false dependency on the source register.<br>

<span class=""><br>

><br>

> We also at some point need to do a post-processing of the shuffles and replace ones that use packed double type when there is an equivalent for packed single type and it removes a bitcast.... It would be really awesome to get the "obvious" code of vmovlps + vmovhps here (or some variant of vmovlps that still targeted the floating point vector unit and didn't have an input dependency... mayb vxorps + vmovlps + vmovhps would be best)<br>

<br>

</span>I’m really not certain saving one cycle of latency on a unit/unit forwarding delay would be worth an entire extra uop;</blockquote><div><br></div><div>All of my measurements indicate that it is actually more than one cycle in practice. =/ It is actually a huge hit on AMD chips, and even on Intel, I've seen code that really fluctuated its performance around this.</div><div><br></div><div>The other reason I'm not worried about it is that xorps X, X should only take up space in the decode buffer, etc. the register renamer and such handles those AFAICT with essentially zero execution cost.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> that doesn’t really feel worth it at all. Plus I’m not even sure those particular instructions have that delay (it’s only specific combinations, I think…?)<br></blockquote><div><br></div><div>That may well be true. I would certainly hope that they get decoded to something less crazy.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

It’s not my fault x86 has weirdly non-orthogonal vector instructions ;-)</blockquote></div><br>;] But without them, the vector shuffle lowering wouldn't be *nearly* so much fun.</div></div>