[PATCH] D77895: [x86] use vector instructions to lower FP->int->FP casts
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Apr 10 15:37:24 PDT 2020
pcordes added a comment.
In D77895#1975191 <https://reviews.llvm.org/D77895#1975191>, @spatel wrote:
> The problem is -0.0 rather than overflow.
Ah yes, that's a showstopper for correctness, thanks.
Is anyone working on a similar patch for double-precision? Zen2 apparently has single-uop CVTPD2DQ XMM, XMM and DQ2PD.
https://www.uops.info/table.html?search=cvtpd2dq&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_CON=on&cb_SNB=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on
It may be break-even on other CPUs for throughput, and a win for latency.
Conroe and Nehalem have single-uop CVTTSD2SI (R32, XMM) so scalar round trip is best there for throughput.
SnB-family CPUs have 2 uop cvt to/from SI or PD<->DQ, but going to xmm->GP-int is port 0. So a scalar round trip distributes back-end port pressure more evenly. So the best choice depends on surrounding code if tuning for Intel without caring about Zen2. In a loop that *only* does (float)(int)x, on Skylake scalar has 1/clock throughput because it avoids a port5 bottleneck. The 2 instructions are p0 + p01 and p5 + p01. vs. cvttpd2dq and back would both be p5 + p01.
We can avoid a false dependency by converting back into the XMM reg we came from, so scalar round trip can avoid needing an extra instruction for xor-zeroing the destination. (But apparently we missed that optimization for float in the testcases without this patch).
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D77895/new/
https://reviews.llvm.org/D77895
More information about the llvm-commits
mailing list