[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
Simon Pilgrim via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 27 02:18:36 PDT 2018
RKSimon added inline comments.
================
Comment at: test/CodeGen/X86/vector-rotate-512.ll:474
+; AVX512BW-NEXT: vpextrb $0, %xmm1, %eax
+; AVX512BW-NEXT: vmovd %eax, %xmm1
+; AVX512BW-NEXT: vpsllw %xmm1, %zmm0, %zmm3
----------------
pcordes wrote:
> This is a pretty questionable way to isolate the low byte of a vector. A vector AND would be 1 uop but needs a constant.
>
> A `vmovdqu8` with zero-masking would also work, costing a mov-immediate + kmov to create a mask register. (But all of that overhead is off the critical path from rotate count to final result).
>
> A left/right vector shift (immediate) would also work, 2 uops / 2c latency on the critical path (instead of 3 uops + 5c latency (3+2) for pextrb + movd). More shifts aren't great when we're already using a lot of shift uops, though.
>
> ```
> vpsllq $56, %xmm1, %xmm0
> vpsrlq $56, %xmm0, %xmm0
> ```
>
> If we are extracting to scalar (or coming from there), it's definitely worth considering BMI2 scalar shifts, which are single uop and mask the shift count, unlike vector shifts which saturate.
>
> So `vmovd %xmm2, %ecx` / `shrx %ecx, %eax, %eax` or something is only 2 uops. (And AVX512BW can vpbroadcastb from a GP register, but that probably costs 2 uops. Agner Fog doesn't seem to have tested `VPBROADCASTB zmm1 {k1}{z}, reg` for his SKX instruction tables.)
Come to think of it, I should be able to use PMOVZXBQ - I'll just need to make getTargetVShiftNode v16i8 aware (it ignores this case at the moment as that type isn't support as a shift).
Repository:
rL LLVM
https://reviews.llvm.org/D51263
More information about the llvm-commits
mailing list