[PATCH] D77895: [x86] use vector instructions to lower FP->int->FP casts

Fri Apr 10 15:37:24 PDT 2020

pcordes added a comment.

In D77895#1975191 <https://reviews.llvm.org/D77895#1975191>, @spatel wrote:

> The problem is -0.0 rather than overflow.

Ah yes, that's a showstopper for correctness, thanks.

Is anyone working on a similar patch for double-precision?  Zen2 apparently has single-uop CVTPD2DQ XMM, XMM and DQ2PD.

https://www.uops.info/table.html?search=cvtpd2dq&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_CON=on&cb_SNB=on&cb_HSW=on&cb_SKX=on&cb_ICL=on&cb_ZEN%2B=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on

It may be break-even on other CPUs for throughput, and a win for latency.

Conroe and Nehalem have single-uop CVTTSD2SI (R32, XMM) so scalar round trip is best there for throughput.

SnB-family CPUs have 2 uop cvt to/from SI or PD<->DQ, but going to xmm->GP-int is port 0.  So a scalar round trip distributes back-end port pressure more evenly.  So the best choice depends on surrounding code if tuning for Intel without caring about Zen2.  In a loop that *only* does (float)(int)x, on Skylake scalar has 1/clock throughput because it avoids a port5 bottleneck.  The 2 instructions are  p0 + p01 and p5 + p01.  vs. cvttpd2dq and back would both be p5 + p01.

We can avoid a false dependency by converting back into the XMM reg we came from, so scalar round trip can avoid needing an extra instruction for xor-zeroing the destination.  (But apparently we missed that optimization for float in the testcases without this patch).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77895/new/

https://reviews.llvm.org/D77895