[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Thu Jan 12 10:09:29 PST 2017

jlebar added a comment.

In https://reviews.llvm.org/D28508#643675, @mehdi_amini wrote:

> In https://reviews.llvm.org/D28508#641479, @jlebar wrote:
>
> > In https://reviews.llvm.org/D28508#641205, @hfinkel wrote:
> >
> > > Can you comment on how this relates to other targets? On x86, AArch64, PPC, and for the AMD GPUs, we have implemented the callback functions getSqrtEstimate and getRecipEstimate to handle generating estimates. The callbacks also specify how many refinement iterations are used to provide answers of approximately the correct precision.
> >
> >
> > Is your thought that because we provide these, we should emit our own such functions rather than emitting the .approx versions?
> >
> > This being SASS I am not totally sure what sqrt.approx.f32 is doing, but here's SASS that does
>

Ah, thanks for figuring it out!  Of course `MUFU.RSQ` is a reciprocal sqrt instruction.  Maybe we're loading something into R3 as preparation for the MUFU.RSQ, otherwise that just seems like dumb codegen.

> So without any Newton iteration, I don't think this can't be IEEE compliant, which is expected considering that it is the `approx` version. This is fine with fast-math though (I'm fairly sure some graphic shader compiler wouldn't even care about handling the denormals...).

Right.

> What is SASS looking for the non approx version? (I believe llvm.sqrt should do the same as the non-approx without fast-math flag).

https://gist.github.com/b3fa71a72a02785cc47be606556d6d4a

>> An unfortunate effect of the fact that we're using ptxas is that we may not be able to match the performance of {r}sqrt.approx with our own implementation in ptx.
> 
> Can you clarify what you mean?

I meant that I wasn't sure whether we could generate code which matched the performance+accuracy of PTX sqrt.approx without using that instruction (e.g. by LLVM emitting a Newton's method hunk).  In particular, now that you parsed the asm -- we see here that we're calling a special HW instruction for the rsqrt, and I have no way to cause this instruction to be emitted except by writing the PTX sqrt.approx.

Unless the suggestion is to take the approx sqrt generated by PTX sqrt.approx and then refine it using Newton's method?  That's an interesting idea but out of scope for this patch, I think.  I'd rather wait to do that until someone wants it.

https://reviews.llvm.org/D28508