[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 27 10:40:41 PDT 2018
pcordes added a comment.
In https://reviews.llvm.org/D51263#1214314, @RKSimon wrote:
> Avoid vector-scalar-vector to zero extend the bottom v16i8 element
pmovzxbq is perfect for SSE4.1 and higher, well spotted.
In a loop, we might still consider `vmovdqu8 xmm0{k1}{z}, xmm0` to avoid port5 pressure, if we can hoist a mask constant. But only if there *is* shuffle port pressure, and AVX512BW is available. (It should be 1c latency for p015 on SKX).
================
Comment at: test/CodeGen/X86/vector-shift-shl-128.ll:673
+; X32-SSE-NEXT: movd %xmm1, %eax
+; X32-SSE-NEXT: movzbl %al, %eax
+; X32-SSE-NEXT: movd %eax, %xmm1
----------------
`movzbl` within the same register defeats mov-elimination on Intel. Prefer `movzbl %al, %edx` or any register other than `%eax`. (`movzwl` never benefits from mov-elimination, but this also applies to `mov %eax,%eax` to zero-extend into RAX.)
Other options:
* vector qword bit-shift left / right
* vector byte-shift left (`pslldq $7, %xmm0`), bit-shift right (`psrlq $56, %xmm0`). distributes the work to different execution units.
Both of those are 2c latency, vs. a round-trip to integer and back being more on Skylake. (movd r,x and x,r are 2c, up from 1c in Broadwell). It's also a big deal on Bulldozer-family, where a movd round-trip to integer regs is about 14 cycles on Piledriver. (Better on Steamroller).
The uops for SSE2 shuffle+shift need the same ports on Intel as movd to/from integer, so we basically just save the `movzbl`, and win everywhere.
Of course if we can prove that the byte is already isolated at the bottom of an XMM reg, we don't need that extra work.
Repository:
rL LLVM
https://reviews.llvm.org/D51263
More information about the llvm-commits
mailing list