[PATCH] D38472: [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS

Tue Oct 3 03:35:46 PDT 2017

RKSimon added a comment.

Added comments before I commit, the remaining regressions should be handled when we enable shuffle combining to create PACKSS/PACKUS as well as combine from them. But that can only be done once lowering has landed.

================
Comment at: test/CodeGen/X86/shuffle-strided-with-offset-256.ll:92
+; AVX2-NEXT:    vpsrld $16, %ymm0, %ymm0
+; AVX2-NEXT:    vpackusdw %ymm0, %ymm0, %ymm0
 ; AVX2-NEXT:    vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
----------------
pcordes wrote:
> Possible regression here, if this happens in a loop.  Saving a pshufb vector constant may be worth it for a one-off, but vpsrld + vpackusdw is pretty much always worse for throughput than vpshufb.
This is a typical example of the separate shuffle (and shuffle like) instructions now falling below the "3-ops" limit before combining to a shuffle with variable masks. This needs to be driven by the scheduler model but we're no where close to supporting that yet.

================
Comment at: test/CodeGen/X86/sse2-intrinsics-x86.ll:687
 ; SSE:       ## BB#0:
-; SSE-NEXT:    pxor %xmm0, %xmm0 ## encoding: [0x66,0x0f,0xef,0xc0]
-; SSE-NEXT:    packssdw LCPI32_0, %xmm0 ## encoding: [0x66,0x0f,0x6b,0x05,A,A,A,A]
-; SSE-NEXT:    ## fixup A - offset: 4, value: LCPI32_0, kind: FK_Data_4
+; SSE-NEXT:    movaps {{.*#+}} xmm0 = [0,0,0,0,32767,32767,65535,32768]
+; SSE-NEXT:    ## encoding: [0x0f,0x28,0x05,A,A,A,A]
----------------
pcordes wrote:
> Apparently constant propagation through packssdw-with-zero wasn't working before, but this fixes it.
I pushed this as a separate commit at  rL314776

================
Comment at: test/CodeGen/X86/vector-compare-results.ll:3553
+; SSE42-NEXT:    punpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm5[0]
+; SSE42-NEXT:    packsswb %xmm6, %xmm4
+; SSE42-NEXT:    pextrb $15, %xmm4, %eax
----------------
pcordes wrote:
> packing into a single vector is a waste if we're still going to pextrb each element separately, and do a bunch of dead stores to `2(%rdi)`...  what the heck is going on here?  Surely the pextr/and/mov asm is total garbage that we really don't want, right?
> 
> BTW, `psrlw $15, %xmm6` before packing from words to bytes will avoid the need for `and $1`, so you could extract directly to memory.
> 
This codegen is still a joke - its doing nothing but demonstrating how bad we handle boolean vectors - if you look below you'll see that every single extracted value is stored to the same byte of memory.....

See https://bugs.llvm.org/show_bug.cgi?id=31265

================
Comment at: test/CodeGen/X86/vector-trunc.ll:409
 ; SSE41-NEXT:    psrad $16, %xmm0
 ; SSE41-NEXT:    psrad $16, %xmm1
+; SSE41-NEXT:    packssdw %xmm0, %xmm1
----------------
pcordes wrote:
> If I'm understanding this function right, there's still a big missed optimization:
> 
> ```
> psrad       $16, %xmm0                           # get the words we want aligned with the garbage in xmm1
> pblendw  $alternating, %xmm1, %xmm0
> pshufb     (fix the order),  %xmm0
> ret
> ```
> 
> But this patch isn't trying to fix that.  TODO: report this separately.
Even better it should just combine to:
```
psrad $16, %xmm0
psrad $16, %xmm1
packssdw %xmm1, %xmm0
```
That should be handled when we enable shuffle combining to create PACKSS/PACKUS nodes (and not just lowering).

================
Comment at: test/CodeGen/X86/vector-trunc.ll:495
+; SSE41-NEXT:    packusdw %xmm0, %xmm1
+; SSE41-NEXT:    packusdw %xmm0, %xmm0
 ; SSE41-NEXT:    punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
----------------
pcordes wrote:
> This is questionable.  2x shift + 2x pack + punpck is probably worse than 2x pshufb / punpck.
> 
> Even better (if register pressure allows) would be 2x pshufb / POR, with 2 different shuffle-masks that leave the data high or low and zero the other half.
I reckon we should be able to combine to
```
psrld $16, %xmm0
psrld $16, %xmm1
packusdw %xmm1, %xmm0
```

Repository:
  rL LLVM

https://reviews.llvm.org/D38472