[libc] [llvm] [libc][math][c23] Add rsqrtf16() function (PR #137545)

Sat Sep 13 09:40:13 PDT 2025

lntue wrote:

> > > Trying to figure out what would be the best option to compute the result. I found that the current polynomial produces the least errors ( bigger ones yield negligible results ) `P = fpminimax(1/sqrt(x), [|0,1,2,3,4,5|], [|SG...|], [0.5, 1]); ` And has ULP Error = 1.0
> > > Also found this already existing implementation:
> > > https://github.com/llvm/llvm-project/blob/ae6b4b23ea4291e937192a3c08d0f3c9835864c2/libc/src/__support/fixed_point/sqrt.h#L39
> > > 
> > > It has some other interesting points that I found when I was doing my research: specifically, Newton's method.
> > > Upd: Tried adding 2 iterations of Newton's method. Each significantly reduced number of errors, but there are still some
> > 
> > 
> > Can you compare the performance of this with
> > ```
> >   fputil::cast<float16>(1.0f / fputil::sqrt(fputil::cast<float>(x)));
> > ```
> 
> I wrote this test to check the performance of the implementation and ran the tests for rsqrtf16 a few times:
> 
> ```
> TEST_F(LlvmLibcRsqrtf16Test, PositiveRange_OneOverSqrtFputil) {
>   for (uint16_t v = POS_START; v <= POS_STOP; ++v) {
>     float16 x = FPBits(v).get_val();
> 
>     float16 y = LIBC_NAMESPACE::fputil::cast<float16, float>(
>         1.0f / LIBC_NAMESPACE::fputil::sqrt<float, float>(
>                    LIBC_NAMESPACE::fputil::cast<float, float16>(x)));
> 
>     EXPECT_MPFR_MATCH_ALL_ROUNDING(mpfr::Operation::Rsqrt, x, y, 1.0);
>   }
> }
> ```
> 
> Turns out that my implementation is ~3x slower than just directly calling `1.0f / fputil::sqrt` :/ Not sure why is that - because of too many branches or I wrote over-complicated approximation. The one you see is the most minimal I was able to derive so far - I started with 7-degree polynomial and 2 iterations of Newton's method and was able to reduce it to 5-degree and 1 iteration. What do you think?

It is actually expected, because single/double precision division and square root in modern hardware are quite efficient.
You can see for example Zen3 in https://www.agner.org/optimize/instruction_tables.pdf
SQRTSS and DIVSS latencies are 14 and 10.5 clocks respectively, while ADDSS/MULPS and VFMA are 3 and 4 clocks.

So unless you can reduce to maybe 3, 4 multiply-adds, the extra logic like branching, exponent reductions, ... around the computations will make it slower than the straightforward sqrt + div in single precision.

For rsqrtf16, the newton-raphson method will be better than sqrt + div for targets without single precision hardware, such as some embedded system.  But in that case, you will need to implement polynomial approximation + newton raphson in integer / fixed point arithmetic to gain the efficiency.

https://github.com/llvm/llvm-project/pull/137545