[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sun Aug 26 16:30:54 PDT 2018
pcordes added a comment.
Looks like an improvement, but haven't fully looked at it yet.
================
Comment at: test/CodeGen/X86/vector-rotate-512.ll:474
+; AVX512BW-NEXT: vpextrb $0, %xmm1, %eax
+; AVX512BW-NEXT: vmovd %eax, %xmm1
+; AVX512BW-NEXT: vpsllw %xmm1, %zmm0, %zmm3
----------------
This is a pretty questionable way to isolate the low byte of a vector. A vector AND would be 1 uop but needs a constant.
A `vmovdqu8` with zero-masking would also work, costing a mov-immediate + kmov to create a mask register. (But all of that overhead is off the critical path from rotate count to final result).
A left/right vector shift (immediate) would also work, 2 uops / 2c latency on the critical path (instead of 3 uops + 5c latency (3+2) for pextrb + movd). More shifts aren't great when we're already using a lot of shift uops, though.
```
vpsllq $56, %xmm1, %xmm0
vpsrlq $56, %xmm0, %xmm0
```
If we are extracting to scalar (or coming from there), it's definitely worth considering BMI2 scalar shifts, which are single uop and mask the shift count, unlike vector shifts which saturate.
So `vmovd %xmm2, %ecx` / `shrx %ecx, %eax, %eax` or something is only 2 uops. (And AVX512BW can vpbroadcastb from a GP register, but that probably costs 2 uops. Agner Fog doesn't seem to have tested `VPBROADCASTB zmm1 {k1}{z}, reg` for his SKX instruction tables.)
Repository:
rL LLVM
https://reviews.llvm.org/D51263
More information about the llvm-commits
mailing list