[PATCH] [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets

Sun Dec 7 08:50:24 PST 2014

On Sun, Dec 7, 2014 at 5:35 PM, Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:

> Added X86vzmovl folded loads tests.
>
> I looked at using a pand with a constant mask as an alternative and saw a
> minimal regression (nearly in the noise) compared to the movq/movss
> versions I was already testing against. I'm worried about pursuing that
> route though - it adds addiitonal memory access
>

I would be *really* surprised if pand is actually slower, especially if we
get a chance to hoist the memory access into a variable within a loop.

pand should be 2 uops, and according to agner's has a throughput of
2/cycle. the pshufd alone is 1 uop and 2/cycle. the movq is also 1 uop with
2/cycle. But because they're in a chain, the critical path is 2 cycles
instead of 1 cycle here (assuming the load doesn't stall, but I think
that's usually a safe assumption in real world code).

On x86, loads (especially of constants) are *crazy* fast in my experience.

> and the mask approach might make it more difficult for future
> optimizations of the multiple pshufd ops that are still in the
> vector-shuffle-128-v4.ll tests.
>

? If we want to fold things, it should happen before we're doing ISel
pattern expansion...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141207/1920752a/attachment.html>