[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Wed Jan 11 14:25:06 PST 2017

hfinkel added a comment.

In https://reviews.llvm.org/D28508#641479, @jlebar wrote:

> In https://reviews.llvm.org/D28508#641205, @hfinkel wrote:
>
> > Can you comment on how this relates to other targets? On x86, AArch64, PPC, and for the AMD GPUs, we have implemented the callback functions getSqrtEstimate and getRecipEstimate to handle generating estimates. The callbacks also specify how many refinement iterations are used to provide answers of approximately the correct precision.
>
>
> Is your thought that because we provide these, we should emit our own such functions rather than emitting the .approx versions?
>
> This being SASS I am not totally sure what sqrt.approx.f32 is doing, but here's SASS that does
>
>   void f(float a, float* b) { *b = sqrt.approx.f32(a); }
>   
>
> https://gist.github.com/anonymous/79a2a90fd22b0fa37fd3e880641bb9b4.  It looks like one iteration of Newton's method?
>
> And here's the equivalent code for rsqrt.approx.f32: https://gist.github.com/0d8881a652e039ee3aff566176d9c98b -- it's the same as above just without a call to recip.
>
> An unfortunate effect of the fact that we're using ptxas is that we may not be able to match the performance of {r}sqrt.approx with our own implementation in ptx.  My thought is we should turn this on in any case, particularly because different GPUs have different machine instructions, and maybe one will either be slow with our PTX Newton's method implementation, or maybe it will have an explicit {r}sqrt.approx machine instruction.  Then if someone wants to characterize the performance difference between this and LLVM's getSqrtEstimate on their particular hw, they can.
>
> Does that sound like a reasonable plan?

When you say, "turn this on", what exactly is "this"? Do you mean the PTX builtin-approximation, or do you mean using our generic CodeGen with one newton iteration, or do you mean using the PTX builtin with our own CodeGen but subtracting the one newton iteration from the count, or something else?

================
Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:938
+// libdevice and the CUDA headers emit llvm.nvvm.sqrt.f, and we want this
+// transformation to apply there.
+//
----------------
The way we normally seem to handle this kind of situation, which I think it probably better in this case too, is to transform the target-specific intrinsic into generic IR in InstCombine. We might want to move all of that logic into the targets at some point via TTI, but that's another mater.

================
Comment at: llvm/lib/Target/NVPTX/NVPTXInstrInfo.td:941
+// TODO: Should we turn this on when only one of the *APPROX flags is enabled?
+// Our value is already approximate...
 //
----------------
jlebar wrote:
> hfinkel wrote:
> > I don't understand this comment. I thought that f32 sqrt on NVIDIA GPUs was only approximate for early generations (sm_1x) and was correct for later ones (sm_2x+): https://devtalk.nvidia.com/default/topic/475101/sqrt-precision/
> > 
> This is inverse-sqrt; the prx rsqrt instruction is always approximate.  Per the first para of the comment, we only enable this transformation when both approx sqrt and approx div is enabled.  The comment here is saying, maybe we should also enable it when one of the two is enabled.
> 
> If this was confusing to you it's probably not well-written, but I am sort of at a loss as to how to improve it.  Suggestions more than welcome.
I missed that this was specifically talking about the inverse. The comment starts by talking about " @llvm.sqrt.f32 and @llvm.nvvm.sqrt.f" which look like regular sqrts. Saying explicitly that, unlike the sqrt, which is or may be exact, the rsqrt is always approximate would help.

https://reviews.llvm.org/D28508