[PATCH] [X86][SSE] Keep 4i32 vector insertions in integer domain on pre-SSE4.1 targets

Sun Dec 7 09:46:38 PST 2014

On Sun, Dec 7, 2014 at 6:34 PM, Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:

> On 7 Dec 2014, at 16:50, Chandler Carruth <chandlerc at gmail.com> wrote:
>
> On Sun, Dec 7, 2014 at 5:35 PM, Simon Pilgrim <llvm-dev at redking.me.uk>
> wrote:
>
>> Added X86vzmovl folded loads tests.
>>
>> I looked at using a pand with a constant mask as an alternative and saw a
>> minimal regression (nearly in the noise) compared to the movq/movss
>> versions I was already testing against. I'm worried about pursuing that
>> route though - it adds addiitonal memory access
>>
>
> I would be *really* surprised if pand is actually slower, especially if we
> get a chance to hoist the memory access into a variable within a loop.
>
> pand should be 2 uops, and according to agner's has a throughput of
> 2/cycle. the pshufd alone is 1 uop and 2/cycle. the movq is also 1 uop with
> 2/cycle. But because they're in a chain, the critical path is 2 cycles
> instead of 1 cycle here (assuming the load doesn't stall, but I think
> that's usually a safe assumption in real world code).
>
> On x86, loads (especially of constants) are *crazy* fast in my experience.
>
>
> Yes if I hoist the mask loading its definitely faster - but if I leave it
> folded in the pand I don’t see any difference.
>

Sure. But by lowering with pand, it should allow coalescing the constant
load and hoisting it out a loop no? And if it doesn't happen to get
hoisted, as you say, no different. That's why I would prefer the pand
lowering I think.

>
> and the mask approach might make it more difficult for future
>> optimizations of the multiple pshufd ops that are still in the
>> vector-shuffle-128-v4.ll tests.
>>
>
> ? If we want to fold things, it should happen before we're doing ISel
> pattern expansion…
>
>
> Yes you’re right - its just that last shuffle/byteshift
> in lowerVectorShuffleAsElementInsertion isn’t doing us any favours as we’re
> making no attempt to fold it with the VZEXT_MOVL that we’ve just generated.
> I could just modify lowerVectorShuffleAsElementInsertion to try and have it
> create something more suitable - overriding most of the patterns (doing
> much of the VZEXT_MOVL work in code). Any better ideas? I wasn’t intending
> to spend so long on this pre-SSE4.1 code…….

Sure, I'm not asking you to fix this in this patch. I'm just saying I don't
think the pand lowering really makes this better or worse -- if we want to
improve it, it'll have to happen in the lowering code.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141207/c61b29a8/attachment.html>