[PATCH] D41484: [X86][SSE] Use PMADDWD for v4i32 multiplies with 17 or more leading zeros
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Dec 30 18:56:39 PST 2017
pcordes added a comment.
In https://reviews.llvm.org/D41484#964762, @RKSimon wrote:
> I'm wondering whether we should try to use MADD for SSE41+ targets as well
Yes, absolutely. Look for alternatives to PMULLD whenever possible except with `-march=sandybridge` / ivybridge, or KNL.
PMADDWD has twice the throughput (and half the latency) of PMULLD on Haswell and Skylake. (Although Skylake does have vector-integer multiply on two ports, so PMULLD is 10c latency, 1c throughput). PMULLD is also half throughput on Core2 (4 uops) and Nehalem (2 uops).
On Jaguar it's half-throughput like on Haswell. On Silvermont, it's 7 uops with 11c throughput (11x worse than PMADDWD).
On Ryzen, they're both single-uop, but PMADDWD has 3c instead of 4c latency, and 1c instead of 2c throughput. Same thing on Bulldozer-family: 4c vs. 5c latency, and 1c vs. 2c throughput.
PMULUDQ (widening multiply of the even elements) is usually as fast as PMADDWD, but **32-bit low-half PMULLD multiply is slow on everything except Intel Sandybridge / Ivybridge, and KNL**. The throughput penalty is at least a factor of 2 on CPUs other than those.
Repository:
rL LLVM
https://reviews.llvm.org/D41484
More information about the llvm-commits
mailing list