[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
Simon Pilgrim via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 27 11:25:38 PDT 2018
RKSimon added a comment.
This patch is entirely in the DAG - so we have no way to recognise when we're in a loop suitable for hoisting masks.
================
Comment at: test/CodeGen/X86/vector-shift-shl-128.ll:673
+; X32-SSE-NEXT: movd %xmm1, %eax
+; X32-SSE-NEXT: movzbl %al, %eax
+; X32-SSE-NEXT: movd %eax, %xmm1
----------------
pcordes wrote:
> `movzbl` within the same register defeats mov-elimination on Intel. Prefer `movzbl %al, %edx` or any register other than `%eax`. (`movzwl` never benefits from mov-elimination, but this also applies to `mov %eax,%eax` to zero-extend into RAX.)
>
> Other options:
>
> * vector qword bit-shift left / right
> * vector byte-shift left (`pslldq $7, %xmm0`), bit-shift right (`psrlq $56, %xmm0`). distributes the work to different execution units.
>
> Both of those are 2c latency, vs. a round-trip to integer and back being more on Skylake. (movd r,x and x,r are 2c, up from 1c in Broadwell). It's also a big deal on Bulldozer-family, where a movd round-trip to integer regs is about 14 cycles on Piledriver. (Better on Steamroller).
>
> The uops for SSE2 shuffle+shift need the same ports on Intel as movd to/from integer, so we basically just save the `movzbl`, and win everywhere.
>
> Of course if we can prove that the byte is already isolated at the bottom of an XMM reg, we don't need that extra work.
>
I'll look at zero-extending with PSLLQ+PSRLQ - this is only necessary pre-SSE41 targets (X86-SSE is only testing SSE2 cases).
Repository:
rL LLVM
https://reviews.llvm.org/D51263
More information about the llvm-commits
mailing list