[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Sat Jan 14 15:09:26 PST 2017

hfinkel added a comment.

In https://reviews.llvm.org/D28508#646153, @jlebar wrote:

> > We might not want the cost of the extra denormal handling in the PTX-provided approximation, but if we lower the nvvm intrinsic in the same way as the regular sqrt intrinsic, then are we forced to do this (it has fixed semantics)?
>
> The SASS for sqrt.approx.f32 and rsqrt.approx.f32 are identical except for the presence of an additional reciprocal call in sqrt.approx.f32.  They both have special cases for denormals.
>
> https://gist.github.com/anonymous/79a2a90fd22b0fa37fd3e880641bb9b4
>  https://gist.github.com/anonymous/0d8881a652e039ee3aff566176d9c98b
>
> If you want a version without the denormal handling, you want sqrt.approx.f32.ftz and rsqrt.approx.f32.ftz.  These look like:
>
> https://gist.github.com/031ea494458f44e2d1ef4e16eec51699
>  https://gist.github.com/b352a20d18792d05395f76ab3aad742f
>
> (Unlike the ones above, these are not identical except for the presence/absence of a reciprocal call.  But they're nonetheless very close.)
>
> With this patch we lower the "fast" version of llvm.sqrt.f32 to sqrt.approx.f32 or sqrt.approx.f32.ftz as appropriate based on the module's configuration, and same for 1.0/llvm.sqrt.f32 going to rsqrt.approx.f32{.ftz}.
>
> This patch lets us treat llvm.nvvm.sqrt.f the same as llvm.sqrt.f32, but I'm happy to get rid of this once we auto-upgrade the nvvm intrinsic to the generic intrinsic (which we can't do until this patch lands because that would regress performance).
>
> Right now ftz is specified in one of three ways (see NVVMReflect.cpp), but I have a patch out to reduce it to just one way.  That will then make it easier for us to switch it to one canonical way, instead of the janky "__CUDA_FTZ" metadata thing we have now.
>
> > [...] The natural way of doing that is just to let the code in DAGCombine form the sqrt approximation from x*rsqrt(x), which is probably faster than the PTX "builtin" approximation anyway (because it lacks the denormal fixups). I think it is definitely worth an experiment (i.e. is x*rsqrt faster than sqrt?).
>
> The denormal behaviors of the reciprocal and non-reciprocal versions of sqrt.approx.f32{.ftz} are all the same as far as I can tell.
>
> I'm happy to entertain the addition of more exotic (r)sqrt approximations controlled by flags or whatever, but this is likely not something my customers care deeply about.

Okay, thanks for checking. If there's no performance benefit to generating x*rsqrt.approx(x) over sqrt.approx(x), then there's no need to consider it.

However, many of us have customers who do tend to care about being able to adjust the accuracy of the approximations used for sqrt (and reciprocals for division). I certainly do. If my Googling has been sufficiently informative, these approximations generally have a maximum numerical error ~2-3 ULP. This will be good enough to many purposes, but prohibitively large for some others. Because this is a generic property of hardware-provided approximations, our target-independent codegen has an infrastructure for generating refinement code of user-configurable expense. Using this should be considered the best practice for backend construction. Specifically, unless the provided approximations have a maximum error <= 1 ULP, then refinement might be desirable, and the backend should never directly match the sqrt intrinsic (etc.) in TableGen. Instead, the backend should always implement the associated TLI callbacks so that the user can independently control which approximations are used, and how aggressively (if at all) these are refined. For NVPTX, I'd expect the default number of refinement iterations will be set to zero.

I realize that I'm asking for a little extra work here, but this is the right way to implement this support.

That having been said, you still need part of this patch, AFAIKT, because all of this infrastructure to which I'm referring is in DAGCombine, and we can't depend on DAGCombine to lower otherwise-legal instructions. So you do need to match sqrt, regardless of anything else. You should not, however, need to add the 1/sqrt -> rsqrt patterns, but once the TLI callbacks are implemented, DAGCombine should do that for you. Thanks again!

https://reviews.llvm.org/D28508