[llvm] Add constant-folding for unary NVVM intrinsics (PR #141233)

Fri Jul 11 03:55:36 PDT 2025

================
@@ -2548,6 +2653,170 @@ static Constant *ConstantFoldScalarCall1(StringRef Name,
         return ConstantFoldFP(atan, APF, Ty);
       case Intrinsic::sqrt:
         return ConstantFoldFP(sqrt, APF, Ty);
+
+      // NVVM Intrinsics:
+      case Intrinsic::nvvm_ceil_ftz_f:
+      case Intrinsic::nvvm_ceil_f:
+      case Intrinsic::nvvm_ceil_d:
+        return ConstantFoldFP(
+            ceil, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID)));
+
+      case Intrinsic::nvvm_cos_approx_ftz_f:
+      case Intrinsic::nvvm_cos_approx_f:
+        return ConstantFoldFP(
+            cos, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID)));
+
+      case Intrinsic::nvvm_ex2_approx_ftz_f:
+      case Intrinsic::nvvm_ex2_approx_d:
+      case Intrinsic::nvvm_ex2_approx_f:
+        return ConstantFoldFP(
+            exp2, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                (nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID))));
----------------
LewisCrawford wrote:

- In nvcc, these intrinsics only get generated when users opt in with the `--use_fast_math` flag.
- Users explicitly using them via intrinsic calls like `__sinf(x)` are [told here](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#intrinsic-functions) that they are "less accurate, but faster".
- Both the PTX and Cuda specs only give max error ranges, not min error ranges, so evaluating them at higher precision on the host does not breach the spec (despite giving different value from when evaluating them on the device).
- If we don't fold these approx intrinsics, we will end up with cases where `--use_fast_math` is slower than the full precise version (which will be fully foldable after my next patch). It seems against the spirit of the `--use_fast_math` to make the result less fast in order to preserve the imprecision it introduces.
- For the extremely rare case of a user wanting just the imprecision, regardless of the speed, using `-disable-fp-call-folding` or applying StrictFP to the function should already allow for this.
- Constant folding is still enabled at -O0, so checking whether the results match will depend more on whether the constant value can reach this intrinsic call, and not what the precision of folding it is. Since these approx intrinsics require `--use_fast_math`, the tests should really use precision/tolerance knobs already, as fma fusion etc. will already be differing between the optimization levels if they are already using `--use_fast_math`.
- The PTX spec gives error bounds for the PTX instructions that each of these intrinsics maps to, so should provide a good estimate of how much a the constant-folded vs device-executed results may differ:
    - [cos.approx](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-cos) and [sin.approx](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-sin) :
         - The maximum absolute error over input range is as follows:
        -  [ -2pi. . 2pi ] = 2^-20.5
        - [ -100pi .. +100pi ] = 2^-14.7
        - Outside [ -100pi .. +100pi ] = best-effort only, no guarantees.
    - [ex2.approx](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-ex2)
        - The maximum ulp error is 2 ulp from correctly rounded result across the full range of inputs.
    - [lg2.approx](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-lg2)
        - The maximum absolute error is 2^-22 when the input operand is in the range (0.5, 2). 
        - For positive finite inputs outside of this interval, maximum relative error is 2^-22.
    - [rsqrt.approx](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-rsqrt )
        - The maximum relative error for rsqrt.f32 over the entire positive finite floating-point range is 2^-22.9
    - [sqrt.approx](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-sqrt):
        -  The maximum relative error over the entire positive finite floating-point range is 2^-23

https://github.com/llvm/llvm-project/pull/141233