[PATCH] D51263: [X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)

Sun Aug 26 16:30:54 PDT 2018

pcordes added a comment.

Looks like an improvement, but haven't fully looked at it yet.

================
Comment at: test/CodeGen/X86/vector-rotate-512.ll:474
+; AVX512BW-NEXT:    vpextrb $0, %xmm1, %eax
+; AVX512BW-NEXT:    vmovd %eax, %xmm1
+; AVX512BW-NEXT:    vpsllw %xmm1, %zmm0, %zmm3
----------------
This is a pretty questionable way to isolate the low byte of a vector.  A vector AND would be 1 uop but needs a constant.

A `vmovdqu8` with zero-masking would also work, costing a mov-immediate + kmov to create a mask register.  (But all of that overhead is off the critical path from rotate count to final result).

A left/right vector shift (immediate) would also work, 2 uops / 2c latency on the critical path (instead of 3 uops + 5c latency (3+2) for pextrb + movd).  More shifts aren't great when we're already using a lot of shift uops, though.

```
    vpsllq $56, %xmm1, %xmm0
    vpsrlq $56, %xmm0, %xmm0
```

If we are extracting to scalar (or coming from there), it's definitely worth considering BMI2 scalar shifts, which are single uop and mask the shift count, unlike vector shifts which saturate.

So  `vmovd %xmm2, %ecx` / `shrx %ecx, %eax, %eax` or something is only 2 uops.  (And AVX512BW can vpbroadcastb from a GP register, but that probably costs 2 uops.  Agner Fog doesn't seem to have tested `VPBROADCASTB zmm1 {k1}{z}, reg` for his SKX instruction tables.)

Repository:
  rL LLVM

https://reviews.llvm.org/D51263