[llvm] Add constant-folding for unary NVVM intrinsics (PR #141233)

Mon Jul 14 02:07:37 PDT 2025

================
@@ -2548,6 +2653,170 @@ static Constant *ConstantFoldScalarCall1(StringRef Name,
         return ConstantFoldFP(atan, APF, Ty);
       case Intrinsic::sqrt:
         return ConstantFoldFP(sqrt, APF, Ty);
+
+      // NVVM Intrinsics:
+      case Intrinsic::nvvm_ceil_ftz_f:
+      case Intrinsic::nvvm_ceil_f:
+      case Intrinsic::nvvm_ceil_d:
+        return ConstantFoldFP(
+            ceil, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID)));
+
+      case Intrinsic::nvvm_cos_approx_ftz_f:
+      case Intrinsic::nvvm_cos_approx_f:
+        return ConstantFoldFP(
+            cos, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID)));
+
+      case Intrinsic::nvvm_ex2_approx_ftz_f:
+      case Intrinsic::nvvm_ex2_approx_d:
+      case Intrinsic::nvvm_ex2_approx_f:
+        return ConstantFoldFP(
+            exp2, APF, Ty,
+            nvvm::GetNVVMDenromMode(
+                (nvvm::UnaryMathIntrinsicShouldFTZ(IntrinsicID))));
----------------
LewisCrawford wrote:

You're right about -O0 not folding. There are no checks for optimization level in ConstantFolding.cpp, and IRBuilder folds many instructions by default when they are created, but it looks like -O0 manages to avoid most const-folding by not calling InstCombine and by the initial parser creating the instructions directly without going through IRBuilder.

I agree that technically the semantics of the front-end shouldn't matter here, as any frontend can theoretically generate these. However, I cannot find any documentation for `nvvm.sin.approx` at the IR level, so am left to infer its semantics from the Cuda spec, NVCC behaviour that generates it, and the PTX spec for `sin.approx` which states it "implements a fast approximation to sine", and only gives max error bounds. From all of these sources, I'd argue that the "approx" in "nvvm.sin.approx" is essentially same as having the "afn" fast-math flag encoded in to the intrinsic's name. This would always imply that users of the intrinsic cannot rely on its precision, and would prefer speed over accuracy. Front-ends should only generate these fast "`approx`" intrinsics in cases where a fast imprecise approximation is expected.

Would the difference between unfolded O0 and folded O3 results be ok if there was an explicit `afn` fast-math flag attached to these intrinsics? Is having the "approx" included in the instrinsic name drastically different than using the separate `afn` flag?

> It may be OK, but I do not know where is the boundary on what's tolerable in this case. I'm biased towards keeping the default behavior unchanged and enable imprecise folding only if we have an unambiguous signal that it's OK to do so.

I'm also not sure what the boundary for what would be tolerable in this case would be. I'd argue that the "`approx`" in the name of the intrinsics is enough of a signal, but I'm probably biased in the direction of folding as much as possible. If anyone has more expertise about what an acceptable error would be for this to deviate between device-side execution with O0 and host-side folding with O3, then I'd be fine to add some limitations e.g. only fold on a restricted range within error tolerances if necessary.

https://github.com/llvm/llvm-project/pull/141233