[libc-commits] [libc] e6226e6 - [libc][math] Improve exp2f performance.

Wed Sep 14 11:45:02 PDT 2022

Author: Tue Ly
Date: 2022-09-14T14:44:25-04:00
New Revision: e6226e6b7234e5fce5a928861541cec3a519f379

URL: https://github.com/llvm/llvm-project/commit/e6226e6b7234e5fce5a928861541cec3a519f379
DIFF: https://github.com/llvm/llvm-project/commit/e6226e6b7234e5fce5a928861541cec3a519f379.diff

LOG: [libc][math] Improve exp2f performance.

Reduce the number of subintervals that need lookup table and optimize
the evaluation steps.

Currently, `exp2f` is computed by reducing to `2^hi * 2^mid * 2^lo` where
`-16/32 <= mid <= 15/32` and `-1/64 <= lo <= 1/64`, and `2^lo` is then
approximated by a degree 6 polynomial.

Experiment with Sollya showed that by using a degree 6 polynomial, we
can approximate `2^lo` for a bigger range with reasonable errors:
```
> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/64, 1/64]);
> dirtyinfnorm(2^x - 1 - x*P, [-1/64, 1/64]);
0x1.e18a1bc09114def49eb851655e2e5c4dd08075ac2p-63

> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/32, 1/32]);
> dirtyinfnorm(2^x - 1 - x*P, [-1/32, 1/32]);
0x1.05627b6ed48ca417fe53e3495f7df4baf84a05e2ap-56
```
So we can optimize the implementation a bit with:
# Reduce the range to `mid = i/16` for `i = 0..15` and `-1/32 <= lo <= 1/32`
# Store the table `2^mid` in bits, and add `hi` directly to its exponent field to compute `2^hi * 2^mid`
# Rearrange the order of evaluating the polynomial approximating `2^lo`.

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 9.534
System LIBC reciprocal throughput : 6.229

BEFORE:
LIBC reciprocal throughput        : 21.405
LIBC reciprocal throughput        : 15.241    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 11.111    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 18.617
LIBC reciprocal throughput        : 12.852    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 9.253     (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 40.869
System LIBC latency : 30.580

BEFORE
LIBC latency        : 64.888
LIBC latency        : 61.027    (with `-msse4.2` flag)
LIBC latency        : 48.778    (with `-mfma` flag)

AFTER
LIBC latency        : 48.803
LIBC latency        : 45.047    (with `-msse4.2` flag)
LIBC latency        : 37.487    (with `-mfma` flag)
```

Reviewed By: sivachandra, orex

Differential Revision: https://reviews.llvm.org/D133870

Added: 
    

Modified: 
    libc/docs/math.rst
    libc/src/math/generic/CMakeLists.txt
    libc/src/math/generic/exp2f.cpp
    libc/src/math/generic/explogxf.h
    libc/test/src/math/exhaustive/exp2f_test.cpp
    libc/test/src/math/explogxf_test.cpp

Removed: 
    


################################################################################
diff  --git a/libc/docs/math.rst b/libc/docs/math.rst
index 55c8748b20b57..3171f01aabc33 100644

--- a/libc/docs/math.rst
+++ b/libc/docs/math.rst
@@ -219,9 +219,7 @@ Performance
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
 | expf         |         9 |                 7 |        44 |                38 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
-| exp2f        |         8 |                 6 |        35 |                23 | :math:`[-10, 10]`                   | i5-1135G7  | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
-+              +-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
-|              |        11 |                 6 |        49 |                31 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
+| exp2f        |         9 |                 6 |        37 |                31 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 22.04 LTS x86_64 | Clang 14.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
 | expm1f       |         9 |                44 |        42 |               121 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+

diff  --git a/libc/src/math/generic/CMakeLists.txt b/libc/src/math/generic/CMakeLists.txt
index ffc06c8239445..77aaa3dd3e2a4 100644
--- a/libc/src/math/generic/CMakeLists.txt
+++ b/libc/src/math/generic/CMakeLists.txt
@@ -549,8 +549,6 @@ add_entrypoint_object(
   HDRS
     ../exp2f.h
   DEPENDS
-    .common_constants
-    .explogxf
     libc.src.__support.FPUtil.fenv_impl
     libc.src.__support.FPUtil.fp_bits
     libc.src.__support.FPUtil.multiply_add

diff  --git a/libc/src/math/generic/exp2f.cpp b/libc/src/math/generic/exp2f.cpp
index 5d6b6bea1f22b..913e6b743f679 100644
--- a/libc/src/math/generic/exp2f.cpp
+++ b/libc/src/math/generic/exp2f.cpp
@@ -7,8 +7,6 @@
 //===----------------------------------------------------------------------===//
 
 #include "src/math/exp2f.h"
-#include "common_constants.h"
-#include "explogxf.h"
 #include "src/__support/FPUtil/FEnvImpl.h"
 #include "src/__support/FPUtil/FPBits.h"
 #include "src/__support/FPUtil/PolyEval.h"
@@ -24,6 +22,17 @@ constexpr uint32_t exval1 = 0x3b42'9d37U;
 constexpr uint32_t exval2 = 0xbcf3'a937U;
 constexpr uint32_t exval_mask = exval1 & exval2;
 
+// Look up table for bit fields of 2^(i/16) for i = 0..15, generated by Sollya
+// with:
+// > for i from 0 to 15 do printdouble(round(2^(i/16), D, RN));
+constexpr int64_t EXP_2_M[16] = {
+    0x3ff0000000000000, 0x3ff0b5586cf9890f, 0x3ff172b83c7d517b,
+    0x3ff2387a6e756238, 0x3ff306fe0a31b715, 0x3ff3dea64c123422,
+    0x3ff4bfdad5362a27, 0x3ff5ab07dd485429, 0x3ff6a09e667f3bcd,
+    0x3ff7a11473eb0187, 0x3ff8ace5422aa0db, 0x3ff9c49182a3f090,
+    0x3ffae89f995ad3ad, 0x3ffc199bdd85529c, 0x3ffd5818dcfba487,
+    0x3ffea4afa2a490da};
+
 LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
   using FPBits = typename fputil::FPBits<float>;
   FPBits xbits(x);
@@ -31,12 +40,14 @@ LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
   uint32_t x_u = xbits.uintval();
   uint32_t x_abs = x_u & 0x7fff'ffffU;
 
-  // // When |x| >= 128, |x| < 2^-25, or x is nan
-  if (unlikely(x_abs >= 0x4300'0000U || x_abs <= 0x3280'0000U)) {
-    // |x| < 2^-25
-    if (x_abs <= 0x3280'0000U) {
-      return 1.0f + x;
-    }
+  // |x| < 2^-25
+  if (unlikely(x_abs <= 0x3280'0000U)) {
+    return 1.0f + x;
+  }
+
+  // // When |x| >= 128, or x is nan
+  if (unlikely(x_abs >= 0x4300'0000U)) {
+
     // x >= 128
     if (!xbits.get_sign()) {
       // x is finite
@@ -50,7 +61,7 @@ LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
       // x is +inf or nan
       return x + FPBits::inf().get_val();
     }
-    // x < -150
+    // x <= -150
     if (x_u >= 0xc316'0000U) {
       // exp(-Inf) = 0
       if (xbits.is_inf())
@@ -66,6 +77,7 @@ LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
     }
   }
 
+  // Check exceptional values.
   if (unlikely(x_u & exval_mask) == exval_mask) {
     if (unlikely(x_u == exval1)) { // x = 0x1.853a6ep-9f
       if (fputil::get_round() == FE_TONEAREST)
@@ -76,7 +88,52 @@ LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
     }
   }
 
-  return exp2_eval(x);
+  // For -150 < x < 128, to compute 2^x, we perform the following range
+  // reduction: find hi, mid, lo such that:
+  //   x = hi + mid + lo, in which
+  //     hi is an integer,
+  //     0 <= mid * 2^4 < 16 is an integer
+  //     -2^(-5) <= lo <= 2^-5.
+  // In particular,
+  //   hi + mid = round(x * 2^4) * 2^(-4).
+  // Then,
+  //   2^x = 2^(hi + mid + lo) = 2^hi * 2^mid * 2^lo.
+  // 2^mid is stored in the lookup table EXP_2_M of 16 elements.
+  // 2^lo is computed using a degree-6 minimax polynomial
+  // generated by Sollya.
+  // We perform 2^hi * 2^lo by simply add hi to the exponent field
+  // of 2^mid.
+
+  // kf = (hi + mid) * 2^4 = round(x * 2^4)
+  float kf = fputil::nearest_integer(x * 16.0f);
+  // dx = lo = x - (hi + mid) = x - kf * 2^(-4)
+  double dx = fputil::multiply_add(-0x1.0p-4f, kf, x);
+
+  int k = static_cast<int>(kf);
+  // hi = floor(kf * 2^(-4))
+  // exp_hi = shift hi to the exponent field of double precision.
+  int64_t exp_hi = static_cast<int64_t>(k >> 4)
+                   << fputil::FloatProperties<double>::MANTISSA_WIDTH;
+  // mh = 2^hi * 2^mid
+  // mh_bits = bit field of mh
+  int64_t mh_bits = EXP_2_M[k & 15] + exp_hi;
+  double mh = fputil::FPBits<double>(uint64_t(mh_bits)).get_val();
+
+  // Degree-5 polynomial approximating (2^x - 1)/x generating by Sollya with:
+  // > P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/32. 1/32]);
+  constexpr double COEFFS[6] = {0x1.62e42fefa39f3p-1,  0x1.ebfbdff82c57bp-3,
+                                0x1.c6b08d6f2d7aap-5,  0x1.3b2ab6fc92f5dp-7,
+                                0x1.5d897cfe27125p-10, 0x1.43090e61e6af1p-13};
+  double dx_sq = dx * dx;
+  double c1 = fputil::multiply_add(dx, COEFFS[1], COEFFS[0]);
+  double c2 = fputil::multiply_add(dx, COEFFS[3], COEFFS[2]);
+  double c3 = fputil::multiply_add(dx, COEFFS[5], COEFFS[4]);
+  double p = fputil::polyeval(dx_sq, c1, c2, c3);
+  // 2^x = 2^(hi + mid + lo)
+  //     = 2^(hi + mid) * 2^lo
+  //     ~ mh * (1 + lo * P(lo))
+  //     = mh + (mh*lo) * P(lo)
+  return fputil::multiply_add(p, dx * mh, mh);
 }
 
 } // namespace __llvm_libc

diff  --git a/libc/src/math/generic/explogxf.h b/libc/src/math/generic/explogxf.h
index 0e3cf9eecbdaa..1688dde1ef749 100644
--- a/libc/src/math/generic/explogxf.h
+++ b/libc/src/math/generic/explogxf.h
@@ -98,27 +98,6 @@ inline static exe_eval_result_t exp_eval(double x) {
   return {mult_e1, r};
 }
 
-inline static double exp2_eval(double x) {
-  double kf = fputil::nearest_integer(x * mlp);
-  double dx = fputil::multiply_add(mmld, kf, x);
-  double mult_f, ml;
-  {
-    uint32_t ps = static_cast<int>(kf) + (1 << (EXP_bits_p - 1)) +
-                  (fputil::FPBits<double>::EXPONENT_BIAS << EXP_bits_p);
-    fputil::FPBits<double> bs;
-    bs.set_unbiased_exponent(ps >> EXP_bits_p);
-    ml = 1.0 + EXP_2_POW[ps & (EXP_num_p - 1)];
-    mult_f = bs.get_val();
-  }
-
-  // Taylor series coefficients for 2^x
-  double pe = fputil::polyeval(
-      dx, 1.0, 0x1.62e42fefa39efp-1, 0x1.ebfbdff82c58fp-3, 0x1.c6b08d704a0c0p-5,
-      0x1.3b2ab6fba4e77p-7, 0x1.5d87fe78a6731p-10, 0x1.430912f86c787p-13);
-
-  return mult_f * ml * pe;
-}
-
 // x should be positive, normal finite value
 inline static double log2_eval(double x) {
   using FPB = fputil::FPBits<double>;

diff  --git a/libc/test/src/math/exhaustive/exp2f_test.cpp b/libc/test/src/math/exhaustive/exp2f_test.cpp
index dfb9eef89b808..da4a2dfa085b3 100644
--- a/libc/test/src/math/exhaustive/exp2f_test.cpp
+++ b/libc/test/src/math/exhaustive/exp2f_test.cpp
@@ -32,9 +32,9 @@ struct LlvmLibcExp2fExhaustiveTest : public LlvmLibcExhaustiveTest<uint32_t> {
   }
 };
 
-// Range: [0, 128];
+// Range: [0, +Inf];
 static constexpr uint32_t POS_START = 0x0000'0000U;
-static constexpr uint32_t POS_STOP = 0x4300'0000U;
+static constexpr uint32_t POS_STOP = 0x7f80'0000U;
 
 TEST_F(LlvmLibcExp2fExhaustiveTest, PostiveRangeRoundNearestTieToEven) {
   test_full_range(POS_START, POS_STOP, mpfr::RoundingMode::Nearest);
@@ -52,9 +52,9 @@ TEST_F(LlvmLibcExp2fExhaustiveTest, PostiveRangeRoundTowardZero) {
   test_full_range(POS_START, POS_STOP, mpfr::RoundingMode::TowardZero);
 }
 
-// Range: [-150, 0];
+// Range: [-Inf, 0];
 static constexpr uint32_t NEG_START = 0x8000'0000U;
-static constexpr uint32_t NEG_STOP = 0xc316'0000U;
+static constexpr uint32_t NEG_STOP = 0xff80'0000U;
 
 TEST_F(LlvmLibcExp2fExhaustiveTest, NegativeRangeRoundNearestTieToEven) {
   test_full_range(NEG_START, NEG_STOP, mpfr::RoundingMode::Nearest);

diff  --git a/libc/test/src/math/explogxf_test.cpp b/libc/test/src/math/explogxf_test.cpp
index cceb3d5158b4a..6aa9cb3ad0b87 100644
--- a/libc/test/src/math/explogxf_test.cpp
+++ b/libc/test/src/math/explogxf_test.cpp
@@ -39,15 +39,6 @@ TEST(LlvmLibcExpxfTest, InFloatRange) {
              def_prec);
 }
 
-TEST(LlvmLibcExp2xfTest, InFloatRange) {
-  auto f_check = [](float x) -> bool {
-    return !(
-        (isnan(x) || isinf(x) || x < -130 || x > 130 || fabsf(x) < 0x1.0p-10));
-  };
-  CHECK_DATA(0.0f, neg_inf, mpfr::Operation::Exp2, __llvm_libc::exp2_eval,
-             f_check, def_count, def_prec);
-}
-
 TEST(LlvmLibcLog2xfTest, InFloatRange) {
   CHECK_DATA(0.0f, inf, mpfr::Operation::Log2, __llvm_libc::log2_eval, f_normal,
              def_count, def_prec);