[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Fri Jan 13 18:19:49 PST 2017

hfinkel added a comment.

Alright, let me try to summarize what we want to do here...

1. Update the LangRef to say that llvm.sqrt returns undef when provided with an input < -0 (instead of having undefined behavior).
2. In InstCombine,  transform llvm.nvvm.sqrt.f into

  s = llvm.sqrt(arg); select(arg >= -0.0, s, NaN);

  unless we have 'fast' set on the intrinsic, in which case we don't need the select (and we set 'fast' on the llvm.sqrt).
3. When we get to the backend: a. We pattern match the select + llvm.sqrt (no 'fast') into the regular sqrt call b. We otherwise transform llvm.sqrt (no 'fast') into the regular sqrt call c. We transform select + llvm.sqrt (with 'fast') or just the llvm.sqrt (with 'fast') into some approximation
  1. For the approximation, we can use: a. The approximation with PTX provides (sqrt.approx.f32) b. The approximation with PTX provides, potentially with some extra newton iterations c. Form our own approximation using rsqrt.approx.f32 (the generic code in DAGCombine should do this automatically using x*rsqrt(x)).

We might not want the cost of the extra denormal handling in the PTX-provided approximation, but if we lower the nvvm intrinsic in the same way as the regular sqrt intrinsic, then are we forced to do this (it has fixed semantics)? If not, and given that you want to transform 1/nvvm.sqrt -> rsqrt, I assume you're comfortable with an answer of no, then we should consider other options. We should provide the option of generating newton iteration fixups, as other targets do, as a user-configurable option (using our target-independent infrastructure for this purpose). The natural way of doing that is just to let the code in DAGCombine form the sqrt approximation from x*rsqrt(x), which is probably faster than the PTX "builtin" approximation anyway (because it lacks the denormal fixups). I think it is definitely worth an experiment (i.e. is x*rsqrt faster than sqrt?).

Regardless, we should find a way to hook this up to our target-independent infrastructure which allows the user to select where to generate approximations and how many newton iterations to use. To do this, we should implement the associated TLI callbacks instead of directly matching the patterns in the TableGen files.

https://reviews.llvm.org/D28508