[PATCH] Improve DAG combine pass on certain IR vector patterns

Fri Jan 16 15:23:58 PST 2015

> On Jan 16, 2015, at 3:13 PM, Chandler Carruth <chandlerc at google.com> wrote:
> 
> Cool, this looks good to me provided your data indicates this pattern works well on other targets as well. =]
> 
> Thanks for working on it! Any chance you can also look at the fact that we use vmovq here rather than vmovlpd?

I would think vmovq is more correct; it zeroes out the rest of the register while vmovlpd doesn’t, so vmovlpd would create a false dependency on the source register.

> 
> We also at some point need to do a post-processing of the shuffles and replace ones that use packed double type when there is an equivalent for packed single type and it removes a bitcast.... It would be really awesome to get the "obvious" code of vmovlps + vmovhps here (or some variant of vmovlps that still targeted the floating point vector unit and didn't have an input dependency... mayb vxorps + vmovlps + vmovhps would be best)

I’m really not certain saving one cycle of latency on a unit/unit forwarding delay would be worth an entire extra uop; that doesn’t really feel worth it at all. Plus I’m not even sure those particular instructions have that delay (it’s only specific combinations, I think…?)

It’s not my fault x86 has weirdly non-orthogonal vector instructions ;-)

Fiona