[libc] [llvm] [libc][math][c23] Add rsqrtf16() function (PR #137545)
via llvm-commits
llvm-commits at lists.llvm.org
Sat Sep 13 09:40:13 PDT 2025
lntue wrote:
> > > Trying to figure out what would be the best option to compute the result. I found that the current polynomial produces the least errors ( bigger ones yield negligible results ) `P = fpminimax(1/sqrt(x), [|0,1,2,3,4,5|], [|SG...|], [0.5, 1]); ` And has ULP Error = 1.0
> > > Also found this already existing implementation:
> > > https://github.com/llvm/llvm-project/blob/ae6b4b23ea4291e937192a3c08d0f3c9835864c2/libc/src/__support/fixed_point/sqrt.h#L39
> > >
> > > It has some other interesting points that I found when I was doing my research: specifically, Newton's method.
> > > Upd: Tried adding 2 iterations of Newton's method. Each significantly reduced number of errors, but there are still some
> >
> >
> > Can you compare the performance of this with
> > ```
> > fputil::cast<float16>(1.0f / fputil::sqrt(fputil::cast<float>(x)));
> > ```
>
> I wrote this test to check the performance of the implementation and ran the tests for rsqrtf16 a few times:
>
> ```
> TEST_F(LlvmLibcRsqrtf16Test, PositiveRange_OneOverSqrtFputil) {
> for (uint16_t v = POS_START; v <= POS_STOP; ++v) {
> float16 x = FPBits(v).get_val();
>
> float16 y = LIBC_NAMESPACE::fputil::cast<float16, float>(
> 1.0f / LIBC_NAMESPACE::fputil::sqrt<float, float>(
> LIBC_NAMESPACE::fputil::cast<float, float16>(x)));
>
> EXPECT_MPFR_MATCH_ALL_ROUNDING(mpfr::Operation::Rsqrt, x, y, 1.0);
> }
> }
> ```
>
> Turns out that my implementation is ~3x slower than just directly calling `1.0f / fputil::sqrt` :/ Not sure why is that - because of too many branches or I wrote over-complicated approximation. The one you see is the most minimal I was able to derive so far - I started with 7-degree polynomial and 2 iterations of Newton's method and was able to reduce it to 5-degree and 1 iteration. What do you think?
It is actually expected, because single/double precision division and square root in modern hardware are quite efficient.
You can see for example Zen3 in https://www.agner.org/optimize/instruction_tables.pdf
SQRTSS and DIVSS latencies are 14 and 10.5 clocks respectively, while ADDSS/MULPS and VFMA are 3 and 4 clocks.
So unless you can reduce to maybe 3, 4 multiply-adds, the extra logic like branching, exponent reductions, ... around the computations will make it slower than the straightforward sqrt + div in single precision.
For rsqrtf16, the newton-raphson method will be better than sqrt + div for targets without single precision hardware, such as some embedded system. But in that case, you will need to implement polynomial approximation + newton raphson in integer / fixed point arithmetic to gain the efficiency.
https://github.com/llvm/llvm-project/pull/137545
More information about the llvm-commits
mailing list