[PATCH] D38472: [X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Oct 2 17:58:14 PDT 2017
pcordes accepted this revision.
pcordes added a comment.
This revision is now accepted and ready to land.
Looks like many significant improvements, but a couple possible regressions where we now get a shift+pack instead of a single pshufb. e.g. in trunc16i32_16i8_lshr, trunc8i32_8i16_lshr, and a couple other cases.
================
Comment at: test/CodeGen/X86/shuffle-strided-with-offset-256.ll:92
+; AVX2-NEXT: vpsrld $16, %ymm0, %ymm0
+; AVX2-NEXT: vpackusdw %ymm0, %ymm0, %ymm0
; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
----------------
Possible regression here, if this happens in a loop. Saving a pshufb vector constant may be worth it for a one-off, but vpsrld + vpackusdw is pretty much always worse for throughput than vpshufb.
================
Comment at: test/CodeGen/X86/sse2-intrinsics-x86.ll:687
; SSE: ## BB#0:
-; SSE-NEXT: pxor %xmm0, %xmm0 ## encoding: [0x66,0x0f,0xef,0xc0]
-; SSE-NEXT: packssdw LCPI32_0, %xmm0 ## encoding: [0x66,0x0f,0x6b,0x05,A,A,A,A]
-; SSE-NEXT: ## fixup A - offset: 4, value: LCPI32_0, kind: FK_Data_4
+; SSE-NEXT: movaps {{.*#+}} xmm0 = [0,0,0,0,32767,32767,65535,32768]
+; SSE-NEXT: ## encoding: [0x0f,0x28,0x05,A,A,A,A]
----------------
Apparently constant propagation through packssdw-with-zero wasn't working before, but this fixes it.
================
Comment at: test/CodeGen/X86/vector-compare-results.ll:3553
+; SSE42-NEXT: punpcklqdq {{.*#+}} xmm4 = xmm4[0],xmm5[0]
+; SSE42-NEXT: packsswb %xmm6, %xmm4
+; SSE42-NEXT: pextrb $15, %xmm4, %eax
----------------
packing into a single vector is a waste if we're still going to pextrb each element separately, and do a bunch of dead stores to `2(%rdi)`... what the heck is going on here? Surely the pextr/and/mov asm is total garbage that we really don't want, right?
BTW, `psrlw $15, %xmm6` before packing from words to bytes will avoid the need for `and $1`, so you could extract directly to memory.
================
Comment at: test/CodeGen/X86/vector-trunc.ll:409
; SSE41-NEXT: psrad $16, %xmm0
; SSE41-NEXT: psrad $16, %xmm1
+; SSE41-NEXT: packssdw %xmm0, %xmm1
----------------
If I'm understanding this function right, there's still a big missed optimization:
```
psrad $16, %xmm0 # get the words we want aligned with the garbage in xmm1
pblendw $alternating, %xmm1, %xmm0
pshufb (fix the order), %xmm0
ret
```
But this patch isn't trying to fix that. TODO: report this separately.
================
Comment at: test/CodeGen/X86/vector-trunc.ll:495
+; SSE41-NEXT: packusdw %xmm0, %xmm1
+; SSE41-NEXT: packusdw %xmm0, %xmm0
; SSE41-NEXT: punpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
----------------
This is questionable. 2x shift + 2x pack + punpck is probably worse than 2x pshufb / punpck.
Even better (if register pressure allows) would be 2x pshufb / POR, with 2 different shuffle-masks that leave the data high or low and zero the other half.
Repository:
rL LLVM
https://reviews.llvm.org/D38472
More information about the llvm-commits
mailing list