[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)

Mon Aug 27 11:25:38 PDT 2018

RKSimon added a comment.

This patch is entirely in the DAG - so we have no way to recognise when we're in a loop suitable for hoisting masks.

================
Comment at: test/CodeGen/X86/vector-shift-shl-128.ll:673
+; X32-SSE-NEXT:    movd %xmm1, %eax
+; X32-SSE-NEXT:    movzbl %al, %eax
+; X32-SSE-NEXT:    movd %eax, %xmm1
----------------
pcordes wrote:
> `movzbl` within the same register defeats mov-elimination on Intel.  Prefer `movzbl %al, %edx` or any register other than `%eax`.  (`movzwl` never benefits from mov-elimination, but this also applies to `mov %eax,%eax` to zero-extend into RAX.)
> 
> Other options:
> 
> * vector qword bit-shift left / right
> * vector byte-shift left (`pslldq $7, %xmm0`), bit-shift right (`psrlq $56, %xmm0`).  distributes the work to different execution units.
> 
> Both of those are 2c latency, vs. a round-trip to integer and back being more on Skylake.  (movd r,x and x,r are 2c, up from 1c in Broadwell).  It's also a big deal on Bulldozer-family, where a movd round-trip to integer regs is about 14 cycles on Piledriver.  (Better on Steamroller).
> 
> The uops for SSE2 shuffle+shift need the same ports on Intel as movd to/from integer, so we basically just save the `movzbl`, and win everywhere.
> 
> Of course if we can prove that the byte is already isolated at the bottom of an XMM reg, we don't need that extra work.
> 
I'll look at zero-extending with PSLLQ+PSRLQ - this is only necessary pre-SSE41 targets (X86-SSE is only testing SSE2 cases).

Repository:
  rL LLVM

https://reviews.llvm.org/D51263