[libclc] [libclc] Use unchecked division for f64 AMDGPU reciprocal (PR #203809)
via cfe-commits
cfe-commits at lists.llvm.org
Sun Jun 14 19:00:23 PDT 2026
llvmorg-github-actions[bot] wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-backend-amdgpu
Author: Joseph Huber (jhuber6)
<details>
<summary>Changes</summary>
Summary:
This matches what the AMD device libraries does. We can avoid extra
steps by only performing the two steps of the Newton-Raphson
approximation of 1 / x. The exceptional cases should not appear in these
math functions, this is local to AMDGPU, and I verified they are bitwise
identical to the AMD math functions with parity in performance now.
---
Full diff: https://github.com/llvm/llvm-project/pull/203809.diff
1 Files Affected:
- (modified) libclc/clc/lib/amdgpu/math/clc_recip_fast.inc (+8-1)
``````````diff
diff --git a/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc b/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc
index 9d635cc700442..e19ec82d7566d 100644
--- a/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc
+++ b/libclc/clc/lib/amdgpu/math/clc_recip_fast.inc
@@ -6,10 +6,17 @@
//
//===----------------------------------------------------------------------===//
-// On AMDGPU the "fast" reciprocal is the hardware v_rcp_f32 approximation,
+// On AMDGPU the "fast" reciprocal is the hardware v_rcp approximation.
_CLC_OVERLOAD _CLC_DEF __CLC_GENTYPE __clc_recip_fast(__CLC_GENTYPE x) {
#if defined(__CLC_SCALAR) && __CLC_FPSIZE == 32
return __builtin_amdgcn_rcpf(x);
+#elif defined(__CLC_SCALAR) && __CLC_FPSIZE == 64
+ // Hardware v_rcp_f64 seed refined with two Newton-Raphson iterations. This
+ // computes 1.0 / x without the full IEEE scaling and subnormal fixups.
+ __CLC_GENTYPE r = __builtin_amdgcn_rcp(x);
+ r = __builtin_fma(__builtin_fma(-x, r, 1.0), r, r);
+ r = __builtin_fma(__builtin_fma(-x, r, 1.0), r, r);
+ return r;
#else
return ((__CLC_GENTYPE)1.0) / x;
#endif
``````````
</details>
https://github.com/llvm/llvm-project/pull/203809
More information about the cfe-commits
mailing list