[PATCH] D28508: [NVPTX] Implement NVPTXTargetLowering::getSqrtEstimate.

Tue Jan 31 15:43:17 PST 2017

jlebar added a comment.

In https://reviews.llvm.org/D28508#662282, @escha wrote:

> That really surprises me that it's faster! I would expect SFU functions like RCP/RSQRT to dwarf the cost of a multiply, especially for double.

Me too.  :)

> Also, do be careful that rcp(rsqrt(x)) and x * rsqrt(x) have different precisions under some implementations (because fmul is 0.5 ULP, while rcp/rsqrt may be as low as 2.5 ULP each).

Yeah, I'm banking on the "you asked for it" aspect of fast-math.  In particular, the only approximate f64 rcp instruction is flush-to-zero, so we call that even if ftz is entirely disabled.

The performance difference is the same with and without ftz on the mul:

  precise sqrt - 73us
  x*rsqrt.approx(x) - 64us
  recip.approx(rsqrt.approx(x)) - 48us
  rsqrt.approx(x) - 48us

Maybe it's an unfair microbenchmark, because I do nothing other than the sqrt and a store.  https://gist.github.com/0ac6f0b0f994339838f5452f96e77cff

Repository:
  rL LLVM

https://reviews.llvm.org/D28508