[PATCH] D38472: [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS
Simon Pilgrim via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Oct 3 03:35:46 PDT 2017
RKSimon added a comment.
Added comments before I commit, the remaining regressions should be handled when we enable shuffle combining to create PACKSS/PACKUS as well as combine from them. But that can only be done once lowering has landed.
================
Comment at: test/CodeGen/X86/shuffle-strided-with-offset-256.ll:92
+; AVX2-NEXT: vpsrld $16, %ymm0, %ymm0
+; AVX2-NEXT: vpackusdw %ymm0, %ymm0, %ymm0
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
----------------
pcordes wrote:
> Possible regression here, if this happens in a loop. Saving a pshufb vector constant may be worth it for a one-off, but vpsrld + vpackusdw is pretty much always worse for throughput than vpshufb.
This is a typical example of the separate shuffle (and shuffle like) instructions now falling below the "3-ops" limit before combining to a shuffle with variable masks. This needs to be driven by the scheduler model but we're no where close to supporting that yet.
================
Comment at: test/CodeGen/X86/sse2-intrinsics-x86.ll:687
; SSE: ## BB#0:
-; SSE-NEXT: pxor %xmm0, %xmm0 ## encoding: [0x66,0x0f,0xef,0xc0]
-; SSE-NEXT: packssdw LCPI32_0, %xmm0 ## encoding: [0x66,0x0f,0x6b,0x05,A,A,A,A]
-; SSE-NEXT: ## fixup A - offset: 4, value: LCPI32_0, kind: FK_Data_4
+; SSE-NEXT: movaps {{.*#+}} xmm0 = [0,0,0,0,32767,32767,65535,32768]
+; SSE-NEXT: ## encoding: [0x0f,0x28,0x05,A,A,A,A]
----------------
pcordes wrote:
> Apparently constant propagation through packssdw-with-zero wasn't working before, but this fixes it.
I pushed this as a separate commit at rL314776
================
Comment at: test/CodeGen/X86/vector-compare-results.ll:3553
+; SSE42-NEXT: punpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm5[0]
+; SSE42-NEXT: packsswb %xmm6, %xmm4
+; SSE42-NEXT: pextrb $15, %xmm4, %eax
----------------
pcordes wrote:
> packing into a single vector is a waste if we're still going to pextrb each element separately, and do a bunch of dead stores to `2(%rdi)`... what the heck is going on here? Surely the pextr/and/mov asm is total garbage that we really don't want, right?
>
> BTW, `psrlw $15, %xmm6` before packing from words to bytes will avoid the need for `and $1`, so you could extract directly to memory.
>
This codegen is still a joke - its doing nothing but demonstrating how bad we handle boolean vectors - if you look below you'll see that every single extracted value is stored to the same byte of memory.....
See https://bugs.llvm.org/show_bug.cgi?id=31265
================
Comment at: test/CodeGen/X86/vector-trunc.ll:409
; SSE41-NEXT: psrad $16, %xmm0
; SSE41-NEXT: psrad $16, %xmm1
+; SSE41-NEXT: packssdw %xmm0, %xmm1
----------------
pcordes wrote:
> If I'm understanding this function right, there's still a big missed optimization:
>
> ```
> psrad $16, %xmm0 # get the words we want aligned with the garbage in xmm1
> pblendw $alternating, %xmm1, %xmm0
> pshufb (fix the order), %xmm0
> ret
> ```
>
> But this patch isn't trying to fix that. TODO: report this separately.
Even better it should just combine to:
```
psrad $16, %xmm0
psrad $16, %xmm1
packssdw %xmm1, %xmm0
```
That should be handled when we enable shuffle combining to create PACKSS/PACKUS nodes (and not just lowering).
================
Comment at: test/CodeGen/X86/vector-trunc.ll:495
+; SSE41-NEXT: packusdw %xmm0, %xmm1
+; SSE41-NEXT: packusdw %xmm0, %xmm0
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
----------------
pcordes wrote:
> This is questionable. 2x shift + 2x pack + punpck is probably worse than 2x pshufb / punpck.
>
> Even better (if register pressure allows) would be 2x pshufb / POR, with 2 different shuffle-masks that leave the data high or low and zero the other half.
I reckon we should be able to combine to
```
psrld $16, %xmm0
psrld $16, %xmm1
packusdw %xmm1, %xmm0
```
Repository:
rL LLVM
https://reviews.llvm.org/D38472
More information about the llvm-commits
mailing list