[libc-commits] [PATCH] D120579: Use __builtin_clz to find leading 1 in generic sqrt (where possible)
Clint Caywood via Phabricator via libc-commits
libc-commits at lists.llvm.org
Mon Feb 28 12:16:17 PST 2022
cratonica added inline comments.
================
Comment at: libc/src/__support/FPUtil/generic/sqrt.h:66
inline void normalize<long double>(int &exponent, __uint128_t &mantissa) {
- // Use binary search to shift the leading 1 bit similar to float.
- // With MantissaWidth<long double> = 112, it will take
- // ceil(log2(112)) = 7 steps checking the mantissa bits.
+ // There is no __builtin_clz for 128-bit integers, so use binary search to
+ // shift the leading 1 bit. With MantissaWidth<long double> = 112, it will
----------------
cratonica wrote:
> cratonica wrote:
> > lntue wrote:
> > > Can you do a simple perf test to see if using clz for 64 bits is faster than binary search? Something like:
> > >
> > >
> > > ```
> > > uint64_t hi_bits = static_cast<uint64_t>(mantissa >> 64);
> > > int shift = hi_bits ? (clz(hi_bits) - 15) : (clz(static_cast<uint64_t>(mantissa)) + 49);
> > > exponent -= shift;
> > > mantissa <<= shift;
> > > ```
> > # Unfortunately there aren't any differential or performance tests for sqrt or sqrtl, so I'll need to add those first in a separate PR. It won't be too much work, just clone the ones for sqrtf.
> > # Also, I don't actually have a machine on which I can test 128-bit floats -- my machine uses the 80-bit x87 format, and aarch64 uses 64-bit for long double.
> >
> >
> Following up here:
>
> re: #1: It seems as though we only have perf tests for the float32 variants of math functions due to the logarithmic increase in the domain required for 64-bit inputs. Trying to run an exhaustive performance test using float64 never completed even after an hour of waiting, and I can't imagine an exhaustive test for 128-bit inputs would complete even after days. So I'm going to write a one-off performance test that terminates after 2^24 iterations to test this, but I won't be checking it in
>
> re: #2: The x87 8-bit variant uses a 64-bit mantissa, which means that __bulting_clzll is will still work after truncation, so this is trivial to implement (and I have confirmed a slight performance increase here using the method mentioned above). I was incorrect that aarch64 uses 64-bit floats for long double, and I have access to some hardware with an aarch64 Cortex-A53 that I can run these performance test on with the changes you mentioned. If performance is improved, then I will update the patch accordingly.
Results for 128-bit floats (long double) from the aarch64 Cortex-A53 core in denormal range:
clz: 28894915309ns
binary search: 29253929397ns
So, just over a 1% performance improvement, which is in-line with what I'm seeing on the 32-bit float sqrtf function.
Therefore, I'm patching in that change (as well as the x87 80-bit specialization).
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D120579/new/
https://reviews.llvm.org/D120579
More information about the libc-commits
mailing list