[libc-commits] [libc] [llvm] [libc][math][c23] Add rsqrtf16() function (PR #137545)

Tue Sep 16 15:49:55 PDT 2025

amemov wrote:

For the record - the implementation had to be changed because it wasn't as fast as calling the hardware specific instructions by a very huge margin. Below are the observations that I saw when I was comparing 4 different implementations using Google Benchmark and [suite](https://blog.llvm.org/posts/2025-08-29-gsoc-profiling-and-testing-math-functions-on-gpus/) introduced as part of GSoC 2025:
```
Run on (14 X 4900 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x7)
L1 Instruction 64 KiB (x7)
L2 Unified 2048 KiB (x7)
L3 Unified 12288 KiB (x1)
Load Average: 1.12, 1.53, 1.24
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-- LOG(0): Value passed to --benchmark_min_time should have a suffix. Eg., `30s` for 30-seconds.-- LOG(0): Value passed to --benchmark_min_time should have a suffix. Eg., `30s` for 30-seconds.-- LOG(0): Value passed to --benchmark_min_time should have a suffix. Eg., `30s` for 30-seconds.-- LOG(0): Value passed to --benchmark_min_time should have a suffix. Eg., `30s` for 30-seconds.---------------------------------------------------------------------
Benchmark           Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------
BM_Current     632704 ns       629457 ns         2220 items_per_second=50.4324M/s
BM_Impl1      1058891 ns      1053017 ns         1339 items_per_second=30.1467M/s
BM_Impl2       686397 ns       684024 ns         2010 items_per_second=46.4092M/s
BM_ViaSqrt     402577 ns       401175 ns         3345 items_per_second=79.13M/s
```
Where `Current` is what I wrote locally (didn't push on PR), `Impl2` is similar to current but has slightly bigger polynomial, `Impl1` is what you see on Github in the previous commit, and `ViaSqrt` is `LIBC_NAMESPACE::fputil::cast<float16>(1.0f / LIBC_NAMESPACE::fputil::sqrt<float>(LIBC_NAMESPACE::fputil::cast<float>(x)));`

However, the `ViaSqrt` as described above was not able to satisfy the correctness requirements established in the Libc across 2 mantissas: 0x0313 and 0x011F. Therefore, I added a small correction at the end. With this changes the implementation is a little slower than directly calling `ViaSqrt` (by ~5000 ns), but still way faster than the best implementation I did ( which is `Current` ) by 200000 ns.

With that being said, there is still some work left for the future: if hardware provides sqrt() instruction - it should be used, otherwise an int-based math approximation should be used for targets that don't have `LIBC_TARGET_CPU_HAS_FPU_FLOAT`

https://github.com/llvm/llvm-project/pull/137545