[PATCH] D50074: [X86][AVX2] Prefer VPBLENDW+VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles

Sun Aug 5 18:22:58 PDT 2018

pcordes added a comment.

With AVX512BW, we should *definitely* be using `vpblendmb zmm0{k1}{z}, zmm1, zmm2`.  According to IACA for SKX, it's single-uop, 1c latency, and runs on any vector ALU port.  (i.e. port 0 or port 5 when port 1 is shut down because 512-bit uops are in flight.)

To get a 64-bit constant into a `k` register, we need a `movabs rcx, 0x12346...` / `kmovq k1, rcx`, or load it from memory.  But I think normally `k` registers aren't under a lot of pressure in most functions.

Loading a `k` register from memory costs 2 uops according to IACA, one of them being a micro-fused load+ALU.  That sounds weird; IDK why you'd need an ALU uop other than the integer->k port-5-only uop.  It might be correct, though; k-register store+load has 0.67c or 0.5c throughput (http://users.atw.hu/instlatx64/GenuineIntel0050654_SkylakeX_InstLatX64.txt).  Anyway, I think probably a mov-immediate is a good choice even for 64-bit integers.

----

**With only AVX512F, we can do bit/byte blends using `vpternlogd`, using a vector control mask** (in a zmm reg, not a k reg).  Given the right truth table, one source can select the corresponding bit from either of the other two operands, so we can replace one of the inputs or replace the selector.

`vpternlogd` is single-uop on AVX512 CPUs, including KNL.

This could be an interesting option for byte blends of 256-bit vectors when used with AVX2 compare results (that put the result in a vector instead of mask reg).  e.g. building manually-vectorized code with 256-bit vectors with `-march=knl`, where we have AVX512F but not BW.  (And not VL, so we'd actually have to use a ZMM instruction.  That's fine on KNL, but very bad on SKX if no other 512-bit instructions were in flight.  We'd like to avoid -mtune=generic -mavx512f being a pitfall of nasty code-gen compared to -march=skylake-avx512)

---

I don't think AVX512 has any immediate blends.  Even `vpblendpd` doesn't have an EVEX encoding, only VEX using only the low 4 bits of the imm8.  At least if it does, they don't have `blend` or `select` in the mnemonic or short description.

I guess you're meant to use `k` registers, even though it's a 2-step 2-uop process to get an immediate into a `k` reg.  (But one of those uops can run on any port, including port 6).  Both those extra uops are off the critical path of vectors in -> vector out, unlike with multi-uop `vpblendvb`.

Of course, VEX `vpblendd` is still excellent, and should be used on 256-bit vectors whenever possible.  e.g. for `_mm256_mask_blend_epi32` with a compile-time constant mask, if register allocation has the operands in the low 16 registers.

(Fun fact: using only ymm16..31 avoids the need for vzeroupper, because their low lanes aren't accessible with legacy SSE instructions.  But missing out on VEX instructions / short-encodings when doing 256-bit vectorization with AVX512 available is a downside to that.)

Repository:
  rL LLVM

https://reviews.llvm.org/D50074