[PATCH] D38472: [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS

Mon Oct 2 17:58:14 PDT 2017

pcordes accepted this revision.
pcordes added a comment.
This revision is now accepted and ready to land.

Looks like many significant improvements, but a couple possible regressions where we now get a shift+pack instead of a single pshufb.  e.g. in trunc16i32_16i8_lshr, trunc8i32_8i16_lshr, and a couple other cases.

================
Comment at: test/CodeGen/X86/shuffle-strided-with-offset-256.ll:92
+; AVX2-NEXT:    vpsrld $16, %ymm0, %ymm0
+; AVX2-NEXT:    vpackusdw %ymm0, %ymm0, %ymm0
 ; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
----------------
Possible regression here, if this happens in a loop.  Saving a pshufb vector constant may be worth it for a one-off, but vpsrld + vpackusdw is pretty much always worse for throughput than vpshufb.

================
Comment at: test/CodeGen/X86/sse2-intrinsics-x86.ll:687
 ; SSE:       ## BB#0:
-; SSE-NEXT:    pxor %xmm0, %xmm0 ## encoding: [0x66,0x0f,0xef,0xc0]
-; SSE-NEXT:    packssdw LCPI32_0, %xmm0 ## encoding: [0x66,0x0f,0x6b,0x05,A,A,A,A]
-; SSE-NEXT:    ## fixup A - offset: 4, value: LCPI32_0, kind: FK_Data_4
+; SSE-NEXT:    movaps {{.*#+}} xmm0 = [0,0,0,0,32767,32767,65535,32768]
+; SSE-NEXT:    ## encoding: [0x0f,0x28,0x05,A,A,A,A]
----------------
Apparently constant propagation through packssdw-with-zero wasn't working before, but this fixes it.

================
Comment at: test/CodeGen/X86/vector-compare-results.ll:3553
+; SSE42-NEXT:    punpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm5[0]
+; SSE42-NEXT:    packsswb %xmm6, %xmm4
+; SSE42-NEXT:    pextrb $15, %xmm4, %eax
----------------
packing into a single vector is a waste if we're still going to pextrb each element separately, and do a bunch of dead stores to `2(%rdi)`...  what the heck is going on here?  Surely the pextr/and/mov asm is total garbage that we really don't want, right?

BTW, `psrlw $15, %xmm6` before packing from words to bytes will avoid the need for `and $1`, so you could extract directly to memory.

================
Comment at: test/CodeGen/X86/vector-trunc.ll:409
 ; SSE41-NEXT:    psrad $16, %xmm0
 ; SSE41-NEXT:    psrad $16, %xmm1
+; SSE41-NEXT:    packssdw %xmm0, %xmm1
----------------
If I'm understanding this function right, there's still a big missed optimization:

```
psrad       $16, %xmm0                           # get the words we want aligned with the garbage in xmm1
pblendw  $alternating, %xmm1, %xmm0
pshufb     (fix the order),  %xmm0
ret
```

But this patch isn't trying to fix that.  TODO: report this separately.

================
Comment at: test/CodeGen/X86/vector-trunc.ll:495
+; SSE41-NEXT:    packusdw %xmm0, %xmm1
+; SSE41-NEXT:    packusdw %xmm0, %xmm0
 ; SSE41-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
----------------
This is questionable.  2x shift + 2x pack + punpck is probably worse than 2x pshufb / punpck.

Even better (if register pressure allows) would be 2x pshufb / POR, with 2 different shuffle-masks that leave the data high or low and zero the other half.

Repository:
  rL LLVM

https://reviews.llvm.org/D38472