[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Fri Jan 13 18:40:03 PST 2017

jlebar added a comment.

> We might not want the cost of the extra denormal handling in the PTX-provided approximation, but if we lower the nvvm intrinsic in the same way as the regular sqrt intrinsic, then are we forced to do this (it has fixed semantics)?

The SASS for sqrt.approx.f32 and rsqrt.approx.f32 are identical except for the presence of an additional reciprocal call in sqrt.approx.f32.  They both have special cases for denormals.

https://gist.github.com/anonymous/79a2a90fd22b0fa37fd3e880641bb9b4
https://gist.github.com/anonymous/0d8881a652e039ee3aff566176d9c98b

If you want a version without the denormal handling, you want sqrt.approx.f32.ftz and rsqrt.approx.f32.ftz.  These look like:

https://gist.github.com/031ea494458f44e2d1ef4e16eec51699
https://gist.github.com/b352a20d18792d05395f76ab3aad742f

(Unlike the ones above, these are not identical except for the presence/absence of a reciprocal call.  But they're nonetheless very close.)

With this patch we lower the "fast" version of llvm.sqrt.f32 to sqrt.approx.f32 or sqrt.approx.f32.ftz as appropriate based on the module's configuration, and same for 1.0/llvm.sqrt.f32 going to rsqrt.approx.f32{.ftz}.

This patch lets us treat llvm.nvvm.sqrt.f the same as llvm.sqrt.f32, but I'm happy to get rid of this once we auto-upgrade the nvvm intrinsic to the generic intrinsic (which we can't do until this patch lands because that would regress performance).

Right now ftz is specified in one of three ways (see NVVMReflect.cpp), but I have a patch out to reduce it to just one way.  That will then make it easier for us to switch it to one canonical way, instead of the janky "__CUDA_FTZ" metadata thing we have now.

> [...] The natural way of doing that is just to let the code in DAGCombine form the sqrt approximation from x*rsqrt(x), which is probably faster than the PTX "builtin" approximation anyway (because it lacks the denormal fixups). I think it is definitely worth an experiment (i.e. is x*rsqrt faster than sqrt?).

The denormal behaviors of the reciprocal and non-reciprocal versions of sqrt.approx.f32{.ftz} are all the same as far as I can tell.

I'm happy to entertain the addition of more exotic (r)sqrt approximations controlled by flags or whatever, but this is likely not something my customers care deeply about.

https://reviews.llvm.org/D28508