[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Wed Jan 11 20:11:26 PST 2017

mehdi_amini added a comment.

In https://reviews.llvm.org/D28508#641479, @jlebar wrote:

> In https://reviews.llvm.org/D28508#641205, @hfinkel wrote:
>
> > Can you comment on how this relates to other targets? On x86, AArch64, PPC, and for the AMD GPUs, we have implemented the callback functions getSqrtEstimate and getRecipEstimate to handle generating estimates. The callbacks also specify how many refinement iterations are used to provide answers of approximately the correct precision.
>
>
> Is your thought that because we provide these, we should emit our own such functions rather than emitting the .approx versions?
>
> This being SASS I am not totally sure what sqrt.approx.f32 is doing, but here's SASS that does
>
>   void f(float a, float* b) { *b = sqrt.approx.f32(a); }
>   
>
> https://gist.github.com/anonymous/79a2a90fd22b0fa37fd3e880641bb9b4.  It looks like one iteration of Newton's method?

I don't see any Newton-Raphson there, but I'm not used to read SASS. It seems to me that:

1. Check if the input is a denormal to anticipate underflow:   FSETP.LT.AND P0, PT, |R0|, 1.175494350822287508e-38, PT;
2. Range-up:  FMUL R0, R0, 16777216;
3. If it was a denormal, use the the upscaled value, otherwise use the original: SEL R0, R0, c[0x0][0x140], P0;

(I have no idea what the load to R3 is doing here...)

4. Reciprocal square-root (may go to infinity or NaN with denormal depending on the implementation, hence the scaling above?). MUFU.RSQ R0, R0;
5. Reciprocal, to get the sqrt result : MUFU.RCP R0, R0;
6. Scale-down (only if the input was a denormal): @P0 FMUL R0, R0, 0.000244140625;
7. store the result.

So without any Newton iteration, I don't think this can't be IEEE compliant, which is expected considering that it is the `approx` version. This is fine with fast-math though (I'm fairly sure some graphic shader compiler wouldn't even care about handling the denormals...).

What is SASS looking for the non approx version? (I believe llvm.sqrt should do the same as the non-approx without fast-math flag).

> An unfortunate effect of the fact that we're using ptxas is that we may not be able to match the performance of {r}sqrt.approx with our own implementation in ptx.

Can you clarify what you mean?

https://reviews.llvm.org/D28508