[PATCH] [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets

Thu Dec 4 12:24:16 PST 2014

I'm rather dubious of this.  Do you have timing info to support this?

Even in the best-case scenario, you're trading a domain-crossing penalty (which is 1 or 2 cycles of latency with no additional µops) for an additional real instruction (shuffle), increasing execution pressure.  e.g.:

xorps %xmm1, %xmm1  // handled in rename, does not consume execute resources on recent HW
movss %xmm0, %xmm1 // single µop, 1 cycle latency
// 1 (2 on Nehalem, IIRC) cycle latency hit if next instruction is not in FP domain

vs.

movq %xmm0, %xmm0 // 1 µop, 1 cycle latency
pshufd %xmm0, %xmm0, blah // 1 µop, 1 cycle latency

You have *maybe* shaved a latency cycle (but probably not) at the cost of an extra µop.  Domain-crossing penalties are worth avoiding, but usually not at the cost of additional µops.  They aren't *that* painful.

In the worst cases (uint to fp), this creates a domain-crossing penalty where there wasn't one before, and also adds execute pressure.

For the z4zz, zz4z, zzz4 cases, I expect we should usually be generating XORPS + SHUFPS.  1 µop executed (possibly with a domain-crossing penalty) vs. 3 µops.

I'm not saying that this can't possibly be a win; I'm saying that I'd like to see numbers to back it up before we do this.

http://reviews.llvm.org/D6526