[PATCH] [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets

Thu Dec 4 12:45:47 PST 2014

On Thu, Dec 4, 2014 at 12:24 PM, Steve Canon <scanon at apple.com> wrote:

> I'm rather dubious of this.  Do you have timing info to support this?
>
> Even in the best-case scenario, you're trading a domain-crossing penalty
> (which is 1 or 2 cycles of latency with no additional µops) for an
> additional real instruction (shuffle), increasing execution pressure.  e.g.:
>
> xorps %xmm1, %xmm1  // handled in rename, does not consume execute
> resources on recent HW
> movss %xmm0, %xmm1 // single µop, 1 cycle latency
> // 1 (2 on Nehalem, IIRC) cycle latency hit if next instruction is not in
> FP domain
>
> vs.
>
> movq %xmm0, %xmm0 // 1 µop, 1 cycle latency
> pshufd %xmm0, %xmm0, blah // 1 µop, 1 cycle latency
>
> You have *maybe* shaved a latency cycle (but probably not) at the cost of
> an extra µop.  Domain-crossing penalties are worth avoiding, but usually
> not at the cost of additional µops.  They aren't *that* painful.
>
> In the worst cases (uint to fp), this creates a domain-crossing penalty
> where there wasn't one before, and also adds execute pressure.
>
> For the z4zz, zz4z, zzz4 cases, I expect we should usually be generating
> XORPS + SHUFPS.  1 µop executed (possibly with a domain-crossing penalty)
> vs. 3 µops.
>
> I'm not saying that this can't possibly be a win; I'm saying that I'd like
> to see numbers to back it up before we do this.
>

I don't have numbers for this specific change, but I do have numbers for a
large number of the changes brought about by my rewrite of the vector
shuffle lowering code.

The second most dramatic improvement was due to reduced domain crossing. I
think we all have been underestimating the cost of these stalls, at least I
know I have. I think the reason they cost more than they seem to is
two-fold:

1) On sandy-bridge and ivy-bridge there is more (2x) ILP available for
integer shuffles than for FP shuffles. As a result, in throughput terms, we
actually seem to be getting a win from this. There seems to be enough
independence between the instructions (much to my surprise).

2) All of the measurements seem to indicate more cost to the actual domain
crossing than I at least expected. I think the primary cause is that the
crossing latency happens both going in and coming out of the "wrong"
domain. I've seen this even when the instructions aren't actually
dependent. I'd really love to have a chip architect from Intel explain this
in detail, but I'm not holding out any hope of that happening.

As a consequence, I'm increasingly advocating for cycle neutral (or close
for large # of cycles) in-domain lowerings, and so far it is paying off.
Vector code is (in my benchmarks) running much faster since we moved LLVM
in this direction, despite using more instructions and uops in some cases.

---

Regarding the z4zz and zz4z patterns, I had already commented that the
lowering is bogus, but it should probably be fixed independently of this.
It's pretty bad right now regardless of the domain....
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141204/7913f681/attachment.html>