[llvm] r217744 - [x86] Teach the new vector shuffle lowering to use BLENDPS and BLENDPD.

Mon Sep 15 04:22:32 PDT 2014

On Mon, Sep 15, 2014 at 3:43 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com
> wrote:

> I guess you are referring to this test in particular:
>
>  ; SSE41-LABEL: @shuffle_v4f32_4zzz
> -; SSE41:         insertps {{.*}} # xmm0 = xmm0[0],zero,zero,zero
> +; SSE41:         xorps %[[X:xmm[0-9]+]], %[[X]]
> +; SSE41-NEXT:    blendps {{.*}} # [[X]] = xmm0[0],[[X]][1,2,3]
> +; SSE41-NEXT:    movaps %[[X]], %xmm0
>  ; SSE41-NEXT:    retq
>
> If we commute the blendps then we can get rid of the extra movaps.
>

Exactly.

>
> Also, in this test, I am not sure if a xorps+blendps would do better
> than a single insertps. In case, we might want to check if one of the
> operands to the blend is a vector of all-zeros and prefer a single
> insertps instead of a xorps+blendps combo.
>

I thought a lot about this. Agner's tables seem to indicate that 'xorps' of
a register with itself is somehow crazy fast. This to a certain extent
makes sense. Plus, blendps can be executed on either of 2 ports while
insertps requires both ports to be occupied (for sandybridge, similarly on
other architectures from what I can see). So I think we'll see higher
throughput with the xorps+blendps sequence than the insertps sequence. But
all this is based on Agner's tables. If you have direct measurements that
indicate the reverse, its easy to fix.

I'll look at your example separatel.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140915/4567a636/attachment.html>