[PATCH] D27692: [x86] use a single shufps when it can save instructions

Tue Dec 13 13:44:08 PST 2016

sroland added a comment.

> I'm not sure I completely agree on the details (In particular, I'm not certain the older non-AVX arches where the mov is needed have zero-latency movs), but I mostly agree. Especially since it looks like the same platforms that need the mov are the ones that *do* have transition penalty. So using a shufps may be worth it only with something like -march=nehalem -mtune=ivb, where we generate SSE code, but expect to run it on a newer platform.

You are right, I shouldn't have said "pre-avx targets", there are no cpus which can't do avx but can do move elimination. But code compiled with just sse flags but running on newer cps will still benefit from move elimination.
But using pshufd instead of shufps would actually be beneficial on some cpus which do have transition penalties (I neglected that previously) - e.g. core2 wolfdale, because shufps (along with all other float, but not double shuffles) is in the int domain anyway, just like pshufd, so the penalties are all the same (all the bulldozers also fall into that category but they of course can do avx, and they also support mov elimination). But expressing these things correctly would really require very model specific shuffle lowering. Which right now just isn't there - the code simply assumes that the int/float domains apply to shuffles as well as arithmetic instructions (hence should avoid using shuffles from wrong domain), which is woefully inaccurate for quite some cpus.

> So this is definitely lower priority, just wanted to point out it exists.

https://reviews.llvm.org/D27692