[PATCH] D50074: [X86][AVX2] Prefer VPBLENDW+VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Aug 7 17:40:56 PDT 2018
pcordes added a comment.
In https://reviews.llvm.org/D50074#1189210, @RKSimon wrote:
> Cheers Peter, I'm going to look at adding combining shuffles to VPBLENDVB/VPBLENDMB in the target shuffle combiner.
Don't forget that a 32-bit mask is cheaper to create with a `mov r32, imm32`, so look for the chance to use `vpblendmw`.
> We already have a 'variable mask' threshold mechanism that allows recent Intel CPUs to merge >2 shuffles to a single variable mask shuffle so the 2*VPBLENDW+VPLENDD regression case can be avoided on those targets (see the 'SLOW' vs 'FAST' codegen checks above).
>From that code-gen, I hope KNL is the only CPU in the "AVX2-SLOW" category; its `vpblendw/d` are efficient but `vpblendvb` is 4 uops (and thus a front-end bottleneck). But KNL's `vpshuflw/hw ymm` are horrible too, 8c throughput vs. 12c for `vpshufb ymm`, so I'm not convinced that multiple `vpshufl/hw` + combine is the way to go vs. just using `vpshufb ymm`. Given the way its front-end works, one huge instruction that gets a lot of uops from microcode ROM in one go is probably better than multiple multi-uop instructions that stall the decoders multiple times. (But this is just based on Agner Fog's guide, not any real testing. Still, the per-instruction throughput numbers can be misleading because nearly every multi-uop instruction's throughput is based on the resulting front-end bottleneck. IDK if microcode can be read fast enough to fill that bubble for later insns...)
If KNL can load a mask for `vpternlogd`, that's probably your best bet for efficient byte blends if AVX512F isn't disabled. But maybe not a high priority to implement because AVX2 byte-manipulation code is generally going to suck on KNL anyway.
----
On anything other than KNL that supports AVX2, it comes down to whether the blend mask can be hoisted out of a loop.
Haswell and is almost always better off with 1 `vpblendvb` (2p5) than 3 separate instructions (2p5 + p015), when we have the mask in a reg already.
Ryzen is much better off with `vpblendvb ymm` (2 uops) than 6 uops.
(IDK about Excavator).
Repository:
rL LLVM
https://reviews.llvm.org/D50074
More information about the llvm-commits
mailing list