[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)

Mon Aug 27 10:40:41 PDT 2018

pcordes added a comment.

In https://reviews.llvm.org/D51263#1214314, @RKSimon wrote:

> Avoid vector-scalar-vector to zero extend the bottom v16i8 element

pmovzxbq is perfect for SSE4.1 and higher, well spotted.

In a loop, we might still consider `vmovdqu8 xmm0{k1}{z}, xmm0` to avoid port5 pressure, if we can hoist a mask constant.  But only if there *is* shuffle port pressure, and AVX512BW is available.  (It should be 1c latency for p015 on SKX).

================
Comment at: test/CodeGen/X86/vector-shift-shl-128.ll:673
+; X32-SSE-NEXT:    movd %xmm1, %eax
+; X32-SSE-NEXT:    movzbl %al, %eax
+; X32-SSE-NEXT:    movd %eax, %xmm1
----------------
`movzbl` within the same register defeats mov-elimination on Intel.  Prefer `movzbl %al, %edx` or any register other than `%eax`.  (`movzwl` never benefits from mov-elimination, but this also applies to `mov %eax,%eax` to zero-extend into RAX.)

Other options:

* vector qword bit-shift left / right
* vector byte-shift left (`pslldq $7, %xmm0`), bit-shift right (`psrlq $56, %xmm0`).  distributes the work to different execution units.

Both of those are 2c latency, vs. a round-trip to integer and back being more on Skylake.  (movd r,x and x,r are 2c, up from 1c in Broadwell).  It's also a big deal on Bulldozer-family, where a movd round-trip to integer regs is about 14 cycles on Piledriver.  (Better on Steamroller).

The uops for SSE2 shuffle+shift need the same ports on Intel as movd to/from integer, so we basically just save the `movzbl`, and win everywhere.

Of course if we can prove that the byte is already isolated at the bottom of an XMM reg, we don't need that extra work.

Repository:
  rL LLVM

https://reviews.llvm.org/D51263