[PATCH] D50074: [X86][AVX2] Prefer VPBLENDW+VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles

Sun Aug 5 17:38:46 PDT 2018

pcordes added a comment.

Nice idea to chain `vpblendw` + `vpblendd`, those are both single-uop instructions on AVX2 CPUs, and `vpblendd` can run on any port.

Skylake: `vblendvb` is 2 uops for any of p015, `vpblendw` is 1 uop for p5.  If you can hoist the vector constant, vpblendvb is equal to 2 immediate blends, and worse than 3.

Haswell's `vblendvb` is 2 uops for p5 only, so it and shuffles can easily bottlenecks on port 5.  `vpblendw` is also port 5 only.  `vpblendw`+`vpblendd` is better, but depending on port pressure, 2x `vpblendw`+`vpblendd` is worse (again assuming you can hoist the vector constant).

Agner Fog strangely doesn't have numbers for `pblendvb` on Piledriver or Ryzen, not even the SSE4 version.  http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt says `vpblendvb xmm` is single-cycle latency on Ryzen, but `vpblendvb ymm` is 2c latency.  (With throughput = latency, so I guess only one port.)  `VPBLENDW ymm` is single-cycle latency, with 0.67c throughput, so I guess it's 1 uop per lane, on 3 ports.  So Ryzen's `vpblendw` is better than Intel's avoiding port bottlenecks.  But `vpblendvb` is also only 1 uop per lane, so it's definitely efficient when we can hoist the mask out of the loop, and register pressure makes that a good thing to spend a register on.

----

My instinct here is that for Intel tunings and probably also generic, we should replace `vpblendvb` with up to 2 uops of `vpblendw` + `vpblendd`, but not 3.

If we can analyze the situation and figure out that `vpblendvb` will definitely have to reload the mask every time, we should consider replacing it even if it takes 3 immediate blends.  Ideally we can check the loop for port 5 pressure.

Stuff like this makes clang hard to use when hand-tuning a loop, though.  I know I'd be very annoyed if I was using a `vpblendvb` intrinsic, and clang replaced it with 2x `vpblendw` + `vpblendd` and created a port 5 bottleneck on Skylake, plus costing more uops.  So we should be very cautious about 3-instruction replacements.

Replacing it with 2 uops can obviously be harmful too in some cases, because `vpblendw` only runs on port 5.  It would be great if there was an option that asked clang to use instructions more closely matching the intrinsics for hand-tuned loops, but we can always write asm by hand to tune for a specific uarch.

Repository:
  rL LLVM

https://reviews.llvm.org/D50074