[PATCH] D28508: [NVPTX] Lower to sqrt.approx and rsqrt.approx under more circumstances.

Wed Jan 11 17:45:48 PST 2017

jlebar added subscribers: mehdi_amini, arsenm.
jlebar added a comment.

> When you say, "turn this on", what exactly is "this"? Do you mean the PTX builtin-approximation, or do you mean using our generic CodeGen with one newton iteration, or do you mean using the PTX builtin with our own CodeGen but subtracting the one newton iteration from the count, or something else?

I would like to enable the functionality that this patch is (trying) to enable, namely emitting ptx-builtin-approximate versions of these functions when compiling with fastmath.

> The way we normally seem to handle this kind of situation, which I think it probably better in this case too, is to transform the target-specific intrinsic into generic IR in InstCombine.

Thanks for the pointer, I wasn't aware of that, but I agree it's nicer.  I'm spinning a patch to use AutoUpgrade to get rid of some nvvm intrinsics entirely, and to use InstCombine to transform other nvvm intrinsics that we can't unconditionally remove into llvm intrinsics, where possible.

But specifically for sqrt (AFAIK) I need to rely on some behavior that's not quite kosher according to the langref.  Specifically, I need to transform llvm.nvvm.sqrt.f into

  s = llvm.sqrt(arg)
  select(arg >= -0.0, s, NaN)

The langref says that llvm.sqrt is undefined behavior if you call it with a negative value, but @arsenm, @mehdi_amini, and I think it should say that it *returns undef* if passed a negative value.  arsenm looked at the blame and the language has been unchanged since llvm.sqrt was added.  His and @mehdi_amini's GPU backends already assume the "returns undef" semantics, and I have code in flight for XLA to do the same.  One of them also did a quick audit and concluded none of our current optimizations use the "undefined behavior" behavior.

Mehdi pointed me at a larger patch to fix up this and other math intrinsics which I now can't find (sorry), but maybe we can make this smaller change in parallel?

WDYT?

https://reviews.llvm.org/D28508