[libc-commits] [libc] 1c89ae7 - [libc][math] Improve sinhf and coshf performance.

Thu Sep 15 06:20:59 PDT 2022

Author: Tue Ly
Date: 2022-09-15T09:20:39-04:00
New Revision: 1c89ae71ea69a9203b35a9d96328f1c7ca54994c

URL: https://github.com/llvm/llvm-project/commit/1c89ae71ea69a9203b35a9d96328f1c7ca54994c
DIFF: https://github.com/llvm/llvm-project/commit/1c89ae71ea69a9203b35a9d96328f1c7ca54994c.diff

LOG: [libc][math] Improve sinhf and coshf performance.

Optimize `sinhf` and `coshf` by computing exp(x) and exp(-x) simultaneously.

Currently `sinhf` and `coshf` are implemented using the following formulas:
```
  sinh(x) = 0.5 *(exp(x) - 1) - 0.5*(exp(-x) - 1)
  cosh(x) = 0.5*exp(x) + 0.5*exp(-x)
```
where `exp(x)` and `exp(-x)` are calculated separately using the formula:
```
  exp(x) ~ 2^hi * 2^mid * exp(dx)
         ~ 2^hi * 2^mid * P(dx)
```
By expanding the polynomial `P(dx)` into even and odd parts
```
  P(dx) = P_even(dx) + dx * P_odd(dx)
```
we can see that the computations of `exp(x)` and `exp(-x)` have many things in common,
namely:
```
  exp(x)  ~ 2^(hi + mid) * (P_even(dx) + dx * P_odd(dx))
  exp(-x) ~ 2^(-(hi + mid)) * (P_even(dx) - dx * P_odd(dx))
```
Expanding `sinh(x)` and `cosh(x)` with respect to the above formulas, we can compute
these two functions as follow in order to maximize the sharing parts:
```
  sinh(x) = (e^x - e^(-x)) / 2
          ~ 0.5 * (P_even * (2^(hi + mid) - 2^(-(hi + mid))) +
                  dx * P_odd * (2^(hi + mid) + 2^(-(hi + mid))))
  cosh(x) = (e^x + e^(-x)) / 2
          ~ 0.5 * (P_even * (2^(hi + mid) + 2^(-(hi + mid))) +
                  dx * P_odd * (2^(hi + mid) - 2^(-(hi + mid))))
```
So in this patch, we perform the following optimizations for `sinhf` and `coshf`:
  # Use the above formulas to maximize sharing intermediate results,
  # Apply similar optimizations from https://reviews.llvm.org/D133870

Performance benchmark using `perf` tool from the CORE-MATH project on Ryzen 1700:
For `sinhf`:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 16.718
System LIBC reciprocal throughput : 63.151

BEFORE:
LIBC reciprocal throughput        : 90.116
LIBC reciprocal throughput        : 28.554    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 22.577    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 36.482
LIBC reciprocal throughput        : 16.955    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 13.943    (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 48.821
System LIBC latency : 137.019

BEFORE
LIBC latency        : 97.122
LIBC latency        : 84.214    (with `-msse4.2` flag)
LIBC latency        : 71.611    (with `-mfma` flag)

AFTER
LIBC latency        : 54.555
LIBC latency        : 50.865    (with `-msse4.2` flag)
LIBC latency        : 48.700    (with `-mfma` flag)
```
For `coshf`:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 16.939
System LIBC reciprocal throughput : 19.695

BEFORE:
LIBC reciprocal throughput        : 52.845
LIBC reciprocal throughput        : 29.174    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 22.553    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 37.169
LIBC reciprocal throughput        : 17.805    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 14.691    (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 48.478
System LIBC latency : 48.044

BEFORE
LIBC latency        : 99.123
LIBC latency        : 85.595    (with `-msse4.2` flag)
LIBC latency        : 72.776    (with `-mfma` flag)

AFTER
LIBC latency        : 57.760
LIBC latency        : 53.967    (with `-msse4.2` flag)
LIBC latency        : 50.987    (with `-mfma` flag)
```

Reviewed By: orex, zimmermann6

Differential Revision: https://reviews.llvm.org/D133913

Added: 
    

Modified: 
    libc/docs/math.rst
    libc/src/math/generic/CMakeLists.txt
    libc/src/math/generic/coshf.cpp
    libc/src/math/generic/exp2f.cpp
    libc/src/math/generic/explogxf.h
    libc/src/math/generic/sinhf.cpp

Removed: 
    


################################################################################
diff  --git a/libc/docs/math.rst b/libc/docs/math.rst
index 3171f01aabc33..4592238890c50 100644

--- a/libc/docs/math.rst
+++ b/libc/docs/math.rst
@@ -215,7 +215,7 @@ Performance
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
 | cosf         |        13 |                32 |        53 |                59 | :math:`[0, 2\pi]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
-| coshf        |        23 |                20 |        73 |                49 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
+| coshf        |        15 |                20 |        51 |                48 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 22.04 LTS x86_64 | Clang 14.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
 | expf         |         9 |                 7 |        44 |                38 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
@@ -245,7 +245,7 @@ Performance
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
 | sincosf      |        19 |                30 |        57 |                68 | :math:`[-\pi, \pi]`                 | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
-| sinhf        |        23 |                64 |        73 |               141 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
+| sinhf        |        14 |                63 |        49 |               137 | :math:`[-10, 10]`                   | Ryzen 1700 | Ubuntu 22.04 LTS x86_64 | Clang 14.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
 | tanf         |        19 |                50 |        82 |               107 | :math:`[-\pi, \pi]`                 | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA           |
 +--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+

diff  --git a/libc/src/math/generic/CMakeLists.txt b/libc/src/math/generic/CMakeLists.txt
index 77aaa3dd3e2a4..a10f8792fd370 100644
--- a/libc/src/math/generic/CMakeLists.txt
+++ b/libc/src/math/generic/CMakeLists.txt
@@ -549,6 +549,7 @@ add_entrypoint_object(
   HDRS
     ../exp2f.h
   DEPENDS
+    .explogxf
     libc.src.__support.FPUtil.fenv_impl
     libc.src.__support.FPUtil.fp_bits
     libc.src.__support.FPUtil.multiply_add

diff  --git a/libc/src/math/generic/coshf.cpp b/libc/src/math/generic/coshf.cpp
index 95f7f7775cf62..13a9def5dcec1 100644
--- a/libc/src/math/generic/coshf.cpp
+++ b/libc/src/math/generic/coshf.cpp
@@ -39,16 +39,13 @@ LLVM_LIBC_FUNCTION(float, coshf, (float x)) {
 
     return x + FPBits::inf().get_val();
   }
-  auto ep_p = exp_eval<-1>(x);
-  auto ep_m = exp_eval<-1>(-x);
-  // 0.5 * exp(x)  = ep_p.mult_exp * (ep_p.r + 1)
-  //               = ep_p.mult_exp * ep_p.r + ep_p.mult_exp
-  // 0.5 * exp(-x) = ep_m.mult_exp * (ep_m.r + 1)
-  //               = ep_m.mult_exp * ep_m.r + ep_m.mult_exp
-  // cos(x) = 0.5 * exp(x) + 0.5 * expm1(-x)
-  double ep = fputil::multiply_add(ep_p.mult_exp, ep_p.r, ep_p.mult_exp) +
-              fputil::multiply_add(ep_m.mult_exp, ep_m.r, ep_m.mult_exp);
-  return ep;
+
+  // TODO: We should be able to reduce the latency and reciprocal throughput
+  // further by using a low degree (maybe 3-7 ?) minimax polynomial for small
+  // but not too small inputs, such as |x| < 2^-2, or |x| < 2^-3.
+
+  // cosh(x) = (e^x + e^(-x)) / 2.
+  return exp_pm_eval</*is_sinh*/ false>(x);
 }
 
 } // namespace __llvm_libc

diff  --git a/libc/src/math/generic/exp2f.cpp b/libc/src/math/generic/exp2f.cpp
index 913e6b743f679..411ae3359b540 100644
--- a/libc/src/math/generic/exp2f.cpp
+++ b/libc/src/math/generic/exp2f.cpp
@@ -16,23 +16,14 @@
 
 #include <errno.h>
 
+#include "explogxf.h"
+
 namespace __llvm_libc {
 
 constexpr uint32_t exval1 = 0x3b42'9d37U;
 constexpr uint32_t exval2 = 0xbcf3'a937U;
 constexpr uint32_t exval_mask = exval1 & exval2;
 
-// Look up table for bit fields of 2^(i/16) for i = 0..15, generated by Sollya
-// with:
-// > for i from 0 to 15 do printdouble(round(2^(i/16), D, RN));
-constexpr int64_t EXP_2_M[16] = {
-    0x3ff0000000000000, 0x3ff0b5586cf9890f, 0x3ff172b83c7d517b,
-    0x3ff2387a6e756238, 0x3ff306fe0a31b715, 0x3ff3dea64c123422,
-    0x3ff4bfdad5362a27, 0x3ff5ab07dd485429, 0x3ff6a09e667f3bcd,
-    0x3ff7a11473eb0187, 0x3ff8ace5422aa0db, 0x3ff9c49182a3f090,
-    0x3ffae89f995ad3ad, 0x3ffc199bdd85529c, 0x3ffd5818dcfba487,
-    0x3ffea4afa2a490da};
-
 LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
   using FPBits = typename fputil::FPBits<float>;
   FPBits xbits(x);
@@ -101,7 +92,7 @@ LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
   // 2^mid is stored in the lookup table EXP_2_M of 16 elements.
   // 2^lo is computed using a degree-6 minimax polynomial
   // generated by Sollya.
-  // We perform 2^hi * 2^lo by simply add hi to the exponent field
+  // We perform 2^hi * 2^mid by simply add hi to the exponent field
   // of 2^mid.
 
   // kf = (hi + mid) * 2^4 = round(x * 2^4)

diff  --git a/libc/src/math/generic/explogxf.h b/libc/src/math/generic/explogxf.h
index 1688dde1ef749..b5639a8ac4194 100644
--- a/libc/src/math/generic/explogxf.h
+++ b/libc/src/math/generic/explogxf.h
@@ -30,6 +30,17 @@ constexpr double mmld = -1.0 / mlp;
 // printf("%.13a,\n", d[i]);
 extern const double EXP_2_POW[EXP_num_p];
 
+// Look up table for bit fields of 2^(i/16) for i = 0..15, generated by Sollya
+// with:
+// > for i from 0 to 15 do printdouble(round(2^(i/16), D, RN));
+inline constexpr int64_t EXP_2_M[16] = {
+    0x3ff0000000000000, 0x3ff0b5586cf9890f, 0x3ff172b83c7d517b,
+    0x3ff2387a6e756238, 0x3ff306fe0a31b715, 0x3ff3dea64c123422,
+    0x3ff4bfdad5362a27, 0x3ff5ab07dd485429, 0x3ff6a09e667f3bcd,
+    0x3ff7a11473eb0187, 0x3ff8ace5422aa0db, 0x3ff9c49182a3f090,
+    0x3ffae89f995ad3ad, 0x3ffc199bdd85529c, 0x3ffd5818dcfba487,
+    0x3ffea4afa2a490da};
+
 constexpr int LOG_P1_BITS = 6;
 constexpr int LOG_P1_SIZE = 1 << LOG_P1_BITS;
 
@@ -54,11 +65,18 @@ extern const double K_LOG2_EVEN[4];
 // EXP_num_p Precise value of the constant is not needed.
 static constexpr double LN2_INV = 0x1.71547652b82fep+0 * EXP_num_p;
 
+// log2(e) * 2^4
+static constexpr double LOG2_E_4 = 0x1.71547652b82fep+4;
+
 // LN2_HIGH + LN2_LOW = ln(2) with precision higher than double(ln(2))
 // Minus sign is to use FMA directly.
 static constexpr double LN2_HIGH = -0x1.62e42fefa0000p-1 / EXP_num_p;
 static constexpr double LN2_LOW = -0x1.cf79abc9e3b3ap-40 / EXP_num_p;
 
+// -log(2) * 2^(-4)
+static constexpr double M_LN2_4_HI = -0x1.62e42fefa0000p-5;
+static constexpr double M_LN2_4_LO = -0x1.cf79abc9e3b3ap-44;
+
 struct exe_eval_result_t {
   // exp(x) = 2^MULT_POWER2 * mult_exp * (r + 1.0)
   // where
@@ -98,6 +116,106 @@ inline static exe_eval_result_t exp_eval(double x) {
   return {mult_e1, r};
 }
 
+// The function correctly calculates sinh(x) and cosh(x) by calculating exp(x)
+// and exp(-x) simultaneously.
+// To compute e^x, we perform the following range
+// reduction: find hi, mid, lo such that:
+//   x = (hi + mid) * log(2) + lo, in which
+//     hi is an integer,
+//     0 <= mid * 2^4 < 16 is an integer
+//     -2^(-5) <= lo * log2(e) <= 2^-5.
+// In particular,
+//   hi + mid = round(x * log2(e) * 2^4) * 2^(-4).
+// Then,
+//   e^x = 2^(hi + mid) * e^lo = 2^hi * 2^mid * e^lo.
+// 2^mid is stored in the lookup table EXP_2_M of 16 elements.
+// e^lo is computed using a degree-6 minimax polynomial
+// generated by Sollya:
+//   e^lo ~ P(lo) = 1 + lo + c2 * lo^2 + ... + c6 * lo^6
+//        = (1 + c2*lo^2 + c4*lo^4 + c6*lo^6) + lo * (1 + c3*lo^2 + c5*lo^4)
+//        = P_even + lo * P_odd
+// We perform 2^hi * 2^mid by simply add hi to the exponent field
+// of 2^mid.
+// To compute e^(-x), notice that:
+//   e^(-x) = 2^(-(hi + mid)) * e^(-lo)
+//          ~ 2^(-(hi + mid)) * P(-lo)
+//          = 2^(-(hi + mid)) * (P_even - lo * P_odd)
+// So:
+//   sinh(x) = (e^x - e^(-x)) / 2
+//           ~ 0.5 * (2^(hi + mid) * (P_even + lo * P_odd) -
+//                    2^(-(hi + mid)) * (P_even - lo * P_odd))
+//           = 0.5 * (P_even * (2^(hi + mid) - 2^(-(hi + mid))) +
+//                    lo * P_odd * (2^(hi + mid) + 2^(-(hi + mid))))
+// And similarly:
+//   cosh(x) = (e^x + e^(-x)) / 2
+//           ~ 0.5 * (P_even * (2^(hi + mid) + 2^(-(hi + mid))) +
+//                    lo * P_odd * (2^(hi + mid) - 2^(-(hi + mid))))
+// The main point of these formulas is that the expensive part of calculating
+// the polynomials approximating lower parts of e^(x) and e^(-x) are shared
+// and only done once.
+template <bool is_sinh> static inline double exp_pm_eval(float x) {
+  double xd = static_cast<double>(x);
+
+  // round(x * log2(e) * 2^4)
+  double kd = fputil::nearest_integer(LOG2_E_4 * xd);
+
+  // k_p = round(x * log2(e) * 2^4)
+  int k_p = static_cast<int>(kd);
+  // k_m = round(-x * log2(e) * 2^4)
+  int k_m = -k_p;
+
+  // hi = floor(kf * 2^(-4))
+  // exp_hi = shift hi to the exponent field of double precision.
+  int64_t exp_hi_p = static_cast<int64_t>((k_p >> 4))
+                     << fputil::FloatProperties<double>::MANTISSA_WIDTH;
+  int64_t exp_hi_m = static_cast<int64_t>((k_m >> 4))
+                     << fputil::FloatProperties<double>::MANTISSA_WIDTH;
+  // mh = 2^hi * 2^mid
+  // mh_bits = bit field of mh
+  int64_t mh_bits_p = EXP_2_M[k_p & 15] + exp_hi_p;
+  int64_t mh_bits_m = EXP_2_M[k_m & 15] + exp_hi_m;
+  double mh_p = fputil::FPBits<double>(uint64_t(mh_bits_p)).get_val();
+  double mh_m = fputil::FPBits<double>(uint64_t(mh_bits_m)).get_val();
+  // mh_sum = 2^(hi + mid) + 2^(-(hi + mid))
+  double mh_sum = mh_p + mh_m;
+  // mh_
diff  = 2^(hi + mid) - 2^(-(hi + mid))
+  double mh_
diff  = mh_p - mh_m;
+
+  // dx = lo = x - (hi + mid) * log(2)
+  double dx = fputil::multiply_add(kd, M_LN2_4_LO,
+                                   fputil::multiply_add(kd, M_LN2_4_HI, xd));
+  double dx2 = dx * dx;
+
+  // Polynomials generated by Sollya with:
+  // Q = fpminimax(expm1(x)/x, 5, [|1, D...|], [-1/32*log(2), 1/32*log(2)]);
+  // Then:
+  //   e^lo ~ P(dx) = 1 + dx + COEFFS[0] * dx^2 + ... + COEFFS[4] * dx^6.
+  constexpr double COEFFS[5] = {0x1.fffffffffffep-2, 0x1.55555554ad3f3p-3,
+                                0x1.55555557179cap-5, 0x1.111228f3478c9p-7,
+                                0x1.6c161beccc69dp-10};
+  // c0 = 1 + COEFFS[0] * lo^2
+  double c0 = fputil::multiply_add(dx2, COEFFS[0], 1.0);
+  // c1 = 1 + COEFFS[0] * lo^2
+  double c1 = fputil::multiply_add(dx2, COEFFS[1], 1.0);
+  // c2 = COEFFS[2] + COEFFS[4] * lo^2
+  double c2 = fputil::multiply_add(dx2, COEFFS[4], COEFFS[2]);
+  double dx4 = dx2 * dx2;
+  // P_even = c0 + c2 * lo^4
+  //        = (1 + COEFFS[0] * lo^2) + lo^4 * (COEFFS[2] + COEFFS[4] * lo^2)
+  //        = 1 + COEFFS[0] * lo^2 + COEFFS[2] * lo^4 + COEFFS[4] * lo^6
+  double p_even = fputil::multiply_add(dx4, c2, c0);
+  // P_odd = c1 + COEFFS[3] * lo^4
+  //       = 1 + COEFFS[1] * lo^2 + COEFFS[3] * lo^4
+  double p_odd = fputil::multiply_add(dx4, COEFFS[3], c1);
+
+  double r;
+  if constexpr (is_sinh)
+    r = fputil::multiply_add(dx * mh_sum, p_odd, p_even * mh_
diff );
+  else
+    r = fputil::multiply_add(dx * mh_
diff , p_odd, p_even * mh_sum);
+  return 0.5 * r;
+}
+
 // x should be positive, normal finite value
 inline static double log2_eval(double x) {
   using FPB = fputil::FPBits<double>;

diff  --git a/libc/src/math/generic/sinhf.cpp b/libc/src/math/generic/sinhf.cpp
index 7b5ac60454a3b..8f16c381b5bc0 100644
--- a/libc/src/math/generic/sinhf.cpp
+++ b/libc/src/math/generic/sinhf.cpp
@@ -66,19 +66,8 @@ LLVM_LIBC_FUNCTION(float, sinhf, (float x)) {
     return fputil::multiply_add(xdbl, pe, xdbl);
   }
 
-  // MULT_POWER2 = -1
-  auto ep_p = exp_eval<-1>(x);  // 0.5 * exp(x)
-  auto ep_m = exp_eval<-1>(-x); // 0.5 * exp(-x)
-
-  // 0.5 * expm1(x)  = ep_p.mult_exp * (ep_p.r + 1) - 0.5
-  //               = ep_p.mult_exp * ep_p.r + ep_p.mult_exp - 0.5
-  // 0.5 * expm1(-x) = ep_m.mult_exp * (ep_m.r + 1) - 0.5
-  //               = ep_m.mult_exp * ep_m.r + ep_m.mult_exp - 0.5
-  // sinh(x) = 0.5 * expm1(x) - 0.5 * expm1(-x)
-  // Using expm1 instead of exp improved precision around zero.
-  double ep = fputil::multiply_add(ep_p.mult_exp, ep_p.r, ep_p.mult_exp - 0.5) -
-              fputil::multiply_add(ep_m.mult_exp, ep_m.r, ep_m.mult_exp - 0.5);
-  return ep;
+  // sinh(x) = (e^x - e^(-x)) / 2.
+  return exp_pm_eval</*is_sinh*/ true>(x);
 }
 
 } // namespace __llvm_libc