[PATCH] D50074: [X86][AVX2] Prefer VPBLENDW+VPBLENDW+VPBLENDD to VPBLENDVB for v16i16 blend shuffles

Tue Aug 7 17:40:56 PDT 2018

pcordes added a comment.

In https://reviews.llvm.org/D50074#1189210, @RKSimon wrote:

> Cheers Peter, I'm going to look at adding combining shuffles to VPBLENDVB/VPBLENDMB in the target shuffle combiner.

Don't forget that a 32-bit mask is cheaper to create with a `mov r32, imm32`, so look for the chance to use `vpblendmw`.

> We already have a 'variable mask' threshold mechanism that allows recent Intel CPUs to merge >2 shuffles to a single variable mask shuffle so the 2*VPBLENDW+VPLENDD regression case can be avoided on those targets (see the 'SLOW' vs 'FAST' codegen checks above).

>From that code-gen, I hope KNL is the only CPU in the "AVX2-SLOW" category; its `vpblendw/d` are efficient but `vpblendvb` is 4 uops (and thus a front-end bottleneck).  But KNL's `vpshuflw/hw ymm` are horrible too, 8c throughput vs. 12c for `vpshufb ymm`, so I'm not convinced that multiple `vpshufl/hw` + combine is the way to go vs. just using `vpshufb ymm`.  Given the way its front-end works, one huge instruction that gets a lot of uops from microcode ROM in one go is probably better than multiple multi-uop instructions that stall the decoders multiple times.  (But this is just based on Agner Fog's guide, not any real testing.  Still, the per-instruction throughput numbers can be misleading because nearly every multi-uop instruction's throughput is based on the resulting front-end bottleneck.  IDK if microcode can be read fast enough to fill that bubble for later insns...)

If KNL can load a mask for `vpternlogd`, that's probably your best bet for efficient byte blends if AVX512F isn't disabled.  But maybe not a high priority to implement because AVX2 byte-manipulation code is generally going to suck on KNL anyway.

----

On anything other than KNL that supports AVX2, it comes down to whether the blend mask can be hoisted out of a loop.

Haswell and is almost always better off with 1 `vpblendvb` (2p5) than 3 separate instructions (2p5 + p015), when we have the mask in a reg already.

Ryzen is much better off with `vpblendvb ymm` (2 uops) than 6 uops.

(IDK about Excavator).

Repository:
  rL LLVM

https://reviews.llvm.org/D50074