[PATCH] D50074: [X86][AVX2] Prefer VPBLENDW+VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sun Aug 5 17:38:46 PDT 2018
pcordes added a comment.
Nice idea to chain `vpblendw` + `vpblendd`, those are both single-uop instructions on AVX2 CPUs, and `vpblendd` can run on any port.
Skylake: `vblendvb` is 2 uops for any of p015, `vpblendw` is 1 uop for p5. If you can hoist the vector constant, vpblendvb is equal to 2 immediate blends, and worse than 3.
Haswell's `vblendvb` is 2 uops for p5 only, so it and shuffles can easily bottlenecks on port 5. `vpblendw` is also port 5 only. `vpblendw`+`vpblendd` is better, but depending on port pressure, 2x `vpblendw`+`vpblendd` is worse (again assuming you can hoist the vector constant).
Agner Fog strangely doesn't have numbers for `pblendvb` on Piledriver or Ryzen, not even the SSE4 version. http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt says `vpblendvb xmm` is single-cycle latency on Ryzen, but `vpblendvb ymm` is 2c latency. (With throughput = latency, so I guess only one port.) `VPBLENDW ymm` is single-cycle latency, with 0.67c throughput, so I guess it's 1 uop per lane, on 3 ports. So Ryzen's `vpblendw` is better than Intel's avoiding port bottlenecks. But `vpblendvb` is also only 1 uop per lane, so it's definitely efficient when we can hoist the mask out of the loop, and register pressure makes that a good thing to spend a register on.
----
My instinct here is that for Intel tunings and probably also generic, we should replace `vpblendvb` with up to 2 uops of `vpblendw` + `vpblendd`, but not 3.
If we can analyze the situation and figure out that `vpblendvb` will definitely have to reload the mask every time, we should consider replacing it even if it takes 3 immediate blends. Ideally we can check the loop for port 5 pressure.
Stuff like this makes clang hard to use when hand-tuning a loop, though. I know I'd be very annoyed if I was using a `vpblendvb` intrinsic, and clang replaced it with 2x `vpblendw` + `vpblendd` and created a port 5 bottleneck on Skylake, plus costing more uops. So we should be very cautious about 3-instruction replacements.
Replacing it with 2 uops can obviously be harmful too in some cases, because `vpblendw` only runs on port 5. It would be great if there was an option that asked clang to use instructions more closely matching the intrinsics for hand-tuned loops, but we can always write asm by hand to tune for a specific uarch.
Repository:
rL LLVM
https://reviews.llvm.org/D50074
More information about the llvm-commits
mailing list