[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Tue Jan 10 10:11:00 PST 2017

jlebar added a comment.

In https://reviews.llvm.org/D28508#641205, @hfinkel wrote:

> Can you comment on how this relates to other targets? On x86, AArch64, PPC, and for the AMD GPUs, we have implemented the callback functions getSqrtEstimate and getRecipEstimate to handle generating estimates. The callbacks also specify how many refinement iterations are used to provide answers of approximately the correct precision.

Is your thought that because we provide these, we should emit our own such functions rather than emitting the .approx versions?

This being SASS I am not totally sure what sqrt.approx.f32 is doing, but here's SASS that does

  void f(float a, float* b) { *b = sqrt.approx.f32(a); }

https://gist.github.com/anonymous/79a2a90fd22b0fa37fd3e880641bb9b4.  It looks like one iteration of Newton's method?

And here's the equivalent code for rsqrt.approx.f32: https://gist.github.com/0d8881a652e039ee3aff566176d9c98b -- it's the same as above just without a call to recip.

An unfortunate effect of the fact that we're using ptxas is that we may not be able to match the performance of {r}sqrt.approx with our own implementation in ptx.  My thought is we should turn this on in any case, particularly because different GPUs have different machine instructions, and maybe one will either be slow with our PTX Newton's method implementation, or maybe it will have an explicit {r}sqrt.approx machine instruction.  Then if someone wants to characterize the performance difference between this and LLVM's getSqrtEstimate on their particular hw, they can.

Does that sound like a reasonable plan?

At the moment this all moot because of an llvm bug where if you compile with the UnsafeFPMath TargetOption enabled, then call an intrinsic (or any other function) which doesn't have the unsafe-fp-math attr set, the inliner will explicitly set UnsafeFPMath=false on you.  This has the effect of pretty much disabling UnsafeFPMath across the board in CUDA programs (and I expect lots of other programs too).

> P.S. You might also look at implementing combineRepeatedFPDivisors in NVPTX.

Thanks, I will look into that.  There's already div.approx enabled with fast-math, and I'll need to figure out whether this is preferable.

================
Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:941
+// TODO: Should we turn this on when only one of the *APPROX flags is enabled?
+// Our value is already approximate...
 //
----------------
hfinkel wrote:
> I don't understand this comment. I thought that f32 sqrt on NVIDIA GPUs was only approximate for early generations (sm_1x) and was correct for later ones (sm_2x+): https://devtalk.nvidia.com/default/topic/475101/sqrt-precision/
> 
This is inverse-sqrt; the prx rsqrt instruction is always approximate.  Per the first para of the comment, we only enable this transformation when both approx sqrt and approx div is enabled.  The comment here is saying, maybe we should also enable it when one of the two is enabled.

If this was confusing to you it's probably not well-written, but I am sort of at a loss as to how to improve it.  Suggestions more than welcome.

https://reviews.llvm.org/D28508