[llvm] Add constant-folding for unary NVVM intrinsics (PR #141233)

Fri Jul 11 10:51:37 PDT 2025

================
@@ -2548,6 +2653,170 @@ static Constant *ConstantFoldScalarCall1(StringRef Name,
         return ConstantFoldFP(atan, APF, Ty);
       case Intrinsic::sqrt:
         return ConstantFoldFP(sqrt, APF, Ty);
+
+      // NVVM Intrinsics:
+      case Intrinsic::nvvm_ceil_ftz_f:
+      case Intrinsic::nvvm_ceil_f:
+      case Intrinsic::nvvm_ceil_d:
+        return ConstantFoldFP(
+            ceil, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID)));
+
+      case Intrinsic::nvvm_cos_approx_ftz_f:
+      case Intrinsic::nvvm_cos_approx_f:
+        return ConstantFoldFP(
+            cos, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID)));
+
+      case Intrinsic::nvvm_ex2_approx_ftz_f:
+      case Intrinsic::nvvm_ex2_approx_d:
+      case Intrinsic::nvvm_ex2_approx_f:
+        return ConstantFoldFP(
+            exp2, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                (nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID))));
----------------
Artem-B wrote:

> In nvcc, these intrinsics only get generated when users opt in with the --use_fast_math flag.

It's not directly relevant here. LLVM's input is IR and it does not matter which front-end generated it and why. We have to operate based on the given IR, not the implicit assumptions why/when a frontend may have generated a particular intrinsic call. For all LLVM knows, the IR was written by something that does care about the intrinsic generating exactly the instruction it maps to, get that instruction executed on the GPU, and expects to get back the GPU-computed result. Granted, ptxas itself may const-fold it on the way to SASS, but we can ignore it for now, as LLVM's responsibility ends on the PTX level.

We can not assume that the front-end only uses that intrinsic when we're compiling with fast-math enabled. If we get "fast-math" hints in IR, that's fine, but it has to be either the IR, or LLVM's own options.
The front-end may be free to do its own const-folding when it generates the IR, but that's up to the front-end and should be done there, before things are passed to LLVM.

> Users explicitly using them via intrinsic calls like __sinf(x) are [told here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#intrinsic-functions) that they are "less accurate, but faster".

The problematic aspect is that the same intrinsic will produce numeric results with the difference higher than one would expect from the "usual" differences in FP ops, depending on the optimization level, and that's what appears the main issue with folding approximate intrinsics.

> For the extremely rare case of a user wanting just the imprecision, regardless of the speed, using -disable-fp-call-folding or applying StrictFP to the function should already allow for this.

It's not the "users want the imprecision", but rather, "users want their tests to work with and without optimizations the same way".
Yes, we have the ways to disable const-folding, or relax test checks, but my point is that the change introduces numeric instability of potentially higher magnitude than would be normally tolerable.

It may be OK, but I do not know where is the boundary on what's tolerable in this case. I'm biased towards keeping the default behavior unchanged and enable imprecise folding only if we have an unambiguous signal that it's OK to do so. E.g. if we do know that fast-math is in effect.

> Constant folding is still enabled at -O0, so checking whether the results match will depend more on whether the constant value can reach this intrinsic call, and not what the precision of folding it is.

I do not think it's the case for clang: https://godbolt.org/z/rzY5hezE3
AFAICT, -O0 does not fold `__builtin_sinf`, even with fast-math, but -O1 does.

> Since these approx intrinsics require --use_fast_math, the tests should really use precision/tolerance knobs already, as fma fusion etc. will already be differing between the optimization levels if they are already using --use_fast_math.

This assertion appears to assume that `nvcc` is the only front-end that may use those intrinsics. That is not the case. We have no visibility into who/where/how generates the input IR, and can not assume that `these approx intrinsics require --use_fast_math` unless the IR has an explicit indication of that fact.

> Both the PTX and Cuda specs only give max error ranges,

And that gives us the upper bound on the expected differences vs unfolded intrinsics. If it's OK to introduce that magnitude of numeric differences, then we're fine.
If it's allowed only under some conditions (fast-math? explicit LLVM flag? something else?) then we need to check for them.



https://github.com/llvm/llvm-project/pull/141233