[PATCH] D27692: [x86] use a single shufps when it can save instructions

Tue Dec 13 11:02:50 PST 2016

sroland added a comment.

I think this should go in, just forget the domain penalties, as it shouldn't be an issue with most cpus. When I looked at this in Agner Fog's guides, my conclusion was that it is probably only really an issue with Nehalem and Via Nanos. If a cpu has just one clock additional latency back and forth it's still worth it replacing 3 shuffle instructions with one from the wrong domain (albeit the latency chain will be the same then) - and if it manages to only replace 2 shuffle instructions from the right domain it might be worse or better in such a case. (If it is actually worse with Nehalem with its 2 clock penalty back and forth would of course depend if some instruction mix is latency bound or throughput bound.)
Also, plenty of the more odd cpus either place all shuffles in int domain anyway or even do something more odd (like the original core2 merom). I suppose ideally the shuffle lowering code would take into account such hw cost differences, but the truth is right now it doesn't really model any of this (e.g. on some cpus unpacks might not have the same cost as pshufd neither and so on).
So, I'm all for it (and suggested fixing it the same in https://llvm.org/bugs/show_bug.cgi?id=27885).

https://reviews.llvm.org/D27692