[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)

Mon Aug 27 02:18:36 PDT 2018

RKSimon added inline comments.

================
Comment at: test/CodeGen/X86/vector-rotate-512.ll:474
+; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
+; AVX512BW-NEXT:    vmovd %eax, %xmm1
+; AVX512BW-NEXT:    vpsllw %xmm1, %zmm0, %zmm3
----------------
pcordes wrote:
> This is a pretty questionable way to isolate the low byte of a vector.  A vector AND would be 1 uop but needs a constant.
> 
> A `vmovdqu8` with zero-masking would also work, costing a mov-immediate + kmov to create a mask register.  (But all of that overhead is off the critical path from rotate count to final result).
> 
> A left/right vector shift (immediate) would also work, 2 uops / 2c latency on the critical path (instead of 3 uops + 5c latency (3+2) for pextrb + movd).  More shifts aren't great when we're already using a lot of shift uops, though.
> 
> ```
>     vpsllq $56, %xmm1, %xmm0
>     vpsrlq $56, %xmm0, %xmm0
> ```
> 
> If we are extracting to scalar (or coming from there), it's definitely worth considering BMI2 scalar shifts, which are single uop and mask the shift count, unlike vector shifts which saturate.
> 
> So  `vmovd %xmm2, %ecx` / `shrx %ecx, %eax, %eax` or something is only 2 uops.  (And AVX512BW can vpbroadcastb from a GP register, but that probably costs 2 uops.  Agner Fog doesn't seem to have tested `VPBROADCASTB zmm1 {k1}{z}, reg` for his SKX instruction tables.)
Come to think of it, I should be able to use PMOVZXBQ - I'll just need to make getTargetVShiftNode v16i8 aware (it ignores this case at the moment as that type isn't support as a shift).

Repository:
  rL LLVM

https://reviews.llvm.org/D51263