[PATCH] D41484: [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros

Sat Dec 30 18:56:39 PST 2017

pcordes added a comment.

In https://reviews.llvm.org/D41484#964762, @RKSimon wrote:

> I'm wondering whether we should try to use MADD for SSE41+ targets as well

Yes, absolutely.  Look for alternatives to PMULLD whenever possible except with `-march=sandybridge` / ivybridge, or KNL.

PMADDWD has twice the throughput (and half the latency) of PMULLD on Haswell and Skylake.  (Although Skylake does have vector-integer multiply on two ports, so PMULLD is 10c latency, 1c throughput).  PMULLD is also half throughput on Core2 (4 uops) and Nehalem (2 uops).

On Jaguar it's half-throughput like on Haswell.  On Silvermont, it's 7 uops with 11c throughput (11x worse than PMADDWD).

On Ryzen, they're both single-uop, but PMADDWD has 3c instead of 4c latency, and 1c instead of 2c throughput.  Same thing on Bulldozer-family: 4c vs. 5c latency, and 1c vs. 2c throughput.

PMULUDQ (widening multiply of the even elements) is usually as fast as PMADDWD, but **32-bit low-half PMULLD multiply is slow on everything except Intel Sandybridge / Ivybridge, and KNL**.  The throughput penalty is at least a factor of 2 on CPUs other than those.

Repository:
  rL LLVM

https://reviews.llvm.org/D41484