[libc-commits] [libc] [libc][math] Fix incorrect logic in fputil::generic::add_or_sub (PR #116129)

Wed Nov 13 16:34:17 PST 2024

================
@@ -160,20 +160,21 @@ add_or_sub(InType x, InType y) {
   } else {
     InStorageType max_mant = max_bits.get_explicit_mantissa() << GUARD_BITS_LEN;
     InStorageType min_mant = min_bits.get_explicit_mantissa() << GUARD_BITS_LEN;
-    int alignment =
-        max_bits.get_biased_exponent() - min_bits.get_biased_exponent();
+
+    int alignment = (max_bits.get_biased_exponent() - max_bits.is_normal()) -
+                    (min_bits.get_biased_exponent() - min_bits.is_normal());
----------------
overmighty wrote:

The formula given in section 9.2.3.2 of *Handbook of Floating-Point Arithmetic* is $\delta = (E_x - n_x) - (E_y - n_y)$. When I implemented `fputil::generic::add_or_sub`, I asked myself why it wasn't just $\delta = E_x - E_y$, and ended up using that instead of the formula given in the book. Today I remembered asking myself that question, so I thought about it again and now it's obvious.

https://github.com/llvm/llvm-project/pull/116129