[libc-commits] [libc] 1c89ae7 - [libc][math] Improve sinhf and coshf performance.
Tue Ly via libc-commits
libc-commits at lists.llvm.org
Thu Sep 15 06:20:59 PDT 2022
Author: Tue Ly
Date: 2022-09-15T09:20:39-04:00
New Revision: 1c89ae71ea69a9203b35a9d96328f1c7ca54994c
URL: https://github.com/llvm/llvm-project/commit/1c89ae71ea69a9203b35a9d96328f1c7ca54994c
DIFF: https://github.com/llvm/llvm-project/commit/1c89ae71ea69a9203b35a9d96328f1c7ca54994c.diff
LOG: [libc][math] Improve sinhf and coshf performance.
Optimize `sinhf` and `coshf` by computing exp(x) and exp(-x) simultaneously.
Currently `sinhf` and `coshf` are implemented using the following formulas:
```
sinh(x) = 0.5 *(exp(x) - 1) - 0.5*(exp(-x) - 1)
cosh(x) = 0.5*exp(x) + 0.5*exp(-x)
```
where `exp(x)` and `exp(-x)` are calculated separately using the formula:
```
exp(x) ~ 2^hi * 2^mid * exp(dx)
~ 2^hi * 2^mid * P(dx)
```
By expanding the polynomial `P(dx)` into even and odd parts
```
P(dx) = P_even(dx) + dx * P_odd(dx)
```
we can see that the computations of `exp(x)` and `exp(-x)` have many things in common,
namely:
```
exp(x) ~ 2^(hi + mid) * (P_even(dx) + dx * P_odd(dx))
exp(-x) ~ 2^(-(hi + mid)) * (P_even(dx) - dx * P_odd(dx))
```
Expanding `sinh(x)` and `cosh(x)` with respect to the above formulas, we can compute
these two functions as follow in order to maximize the sharing parts:
```
sinh(x) = (e^x - e^(-x)) / 2
~ 0.5 * (P_even * (2^(hi + mid) - 2^(-(hi + mid))) +
dx * P_odd * (2^(hi + mid) + 2^(-(hi + mid))))
cosh(x) = (e^x + e^(-x)) / 2
~ 0.5 * (P_even * (2^(hi + mid) + 2^(-(hi + mid))) +
dx * P_odd * (2^(hi + mid) - 2^(-(hi + mid))))
```
So in this patch, we perform the following optimizations for `sinhf` and `coshf`:
# Use the above formulas to maximize sharing intermediate results,
# Apply similar optimizations from https://reviews.llvm.org/D133870
Performance benchmark using `perf` tool from the CORE-MATH project on Ryzen 1700:
For `sinhf`:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput : 16.718
System LIBC reciprocal throughput : 63.151
BEFORE:
LIBC reciprocal throughput : 90.116
LIBC reciprocal throughput : 28.554 (with `-msse4.2` flag)
LIBC reciprocal throughput : 22.577 (with `-mfma` flag)
AFTER:
LIBC reciprocal throughput : 36.482
LIBC reciprocal throughput : 16.955 (with `-msse4.2` flag)
LIBC reciprocal throughput : 13.943 (with `-mfma` flag)
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency : 48.821
System LIBC latency : 137.019
BEFORE
LIBC latency : 97.122
LIBC latency : 84.214 (with `-msse4.2` flag)
LIBC latency : 71.611 (with `-mfma` flag)
AFTER
LIBC latency : 54.555
LIBC latency : 50.865 (with `-msse4.2` flag)
LIBC latency : 48.700 (with `-mfma` flag)
```
For `coshf`:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput : 16.939
System LIBC reciprocal throughput : 19.695
BEFORE:
LIBC reciprocal throughput : 52.845
LIBC reciprocal throughput : 29.174 (with `-msse4.2` flag)
LIBC reciprocal throughput : 22.553 (with `-mfma` flag)
AFTER:
LIBC reciprocal throughput : 37.169
LIBC reciprocal throughput : 17.805 (with `-msse4.2` flag)
LIBC reciprocal throughput : 14.691 (with `-mfma` flag)
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency : 48.478
System LIBC latency : 48.044
BEFORE
LIBC latency : 99.123
LIBC latency : 85.595 (with `-msse4.2` flag)
LIBC latency : 72.776 (with `-mfma` flag)
AFTER
LIBC latency : 57.760
LIBC latency : 53.967 (with `-msse4.2` flag)
LIBC latency : 50.987 (with `-mfma` flag)
```
Reviewed By: orex, zimmermann6
Differential Revision: https://reviews.llvm.org/D133913
Added:
Modified:
libc/docs/math.rst
libc/src/math/generic/CMakeLists.txt
libc/src/math/generic/coshf.cpp
libc/src/math/generic/exp2f.cpp
libc/src/math/generic/explogxf.h
libc/src/math/generic/sinhf.cpp
Removed:
################################################################################
diff --git a/libc/docs/math.rst b/libc/docs/math.rst
index 3171f01aabc33..4592238890c50 100644
--- a/libc/docs/math.rst
+++ b/libc/docs/math.rst
@@ -215,7 +215,7 @@ Performance
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
| cosf | 13 | 32 | 53 | 59 | :math:`[0, 2\pi]` | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA |
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
-| coshf | 23 | 20 | 73 | 49 | :math:`[-10, 10]` | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA |
+| coshf | 15 | 20 | 51 | 48 | :math:`[-10, 10]` | Ryzen 1700 | Ubuntu 22.04 LTS x86_64 | Clang 14.0.0 | FMA |
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
| expf | 9 | 7 | 44 | 38 | :math:`[-10, 10]` | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA |
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
@@ -245,7 +245,7 @@ Performance
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
| sincosf | 19 | 30 | 57 | 68 | :math:`[-\pi, \pi]` | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA |
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
-| sinhf | 23 | 64 | 73 | 141 | :math:`[-10, 10]` | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA |
+| sinhf | 14 | 63 | 49 | 137 | :math:`[-10, 10]` | Ryzen 1700 | Ubuntu 22.04 LTS x86_64 | Clang 14.0.0 | FMA |
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
| tanf | 19 | 50 | 82 | 107 | :math:`[-\pi, \pi]` | Ryzen 1700 | Ubuntu 20.04 LTS x86_64 | Clang 12.0.0 | FMA |
+--------------+-----------+-------------------+-----------+-------------------+-------------------------------------+------------+-------------------------+--------------+---------------+
diff --git a/libc/src/math/generic/CMakeLists.txt b/libc/src/math/generic/CMakeLists.txt
index 77aaa3dd3e2a4..a10f8792fd370 100644
--- a/libc/src/math/generic/CMakeLists.txt
+++ b/libc/src/math/generic/CMakeLists.txt
@@ -549,6 +549,7 @@ add_entrypoint_object(
HDRS
../exp2f.h
DEPENDS
+ .explogxf
libc.src.__support.FPUtil.fenv_impl
libc.src.__support.FPUtil.fp_bits
libc.src.__support.FPUtil.multiply_add
diff --git a/libc/src/math/generic/coshf.cpp b/libc/src/math/generic/coshf.cpp
index 95f7f7775cf62..13a9def5dcec1 100644
--- a/libc/src/math/generic/coshf.cpp
+++ b/libc/src/math/generic/coshf.cpp
@@ -39,16 +39,13 @@ LLVM_LIBC_FUNCTION(float, coshf, (float x)) {
return x + FPBits::inf().get_val();
}
- auto ep_p = exp_eval<-1>(x);
- auto ep_m = exp_eval<-1>(-x);
- // 0.5 * exp(x) = ep_p.mult_exp * (ep_p.r + 1)
- // = ep_p.mult_exp * ep_p.r + ep_p.mult_exp
- // 0.5 * exp(-x) = ep_m.mult_exp * (ep_m.r + 1)
- // = ep_m.mult_exp * ep_m.r + ep_m.mult_exp
- // cos(x) = 0.5 * exp(x) + 0.5 * expm1(-x)
- double ep = fputil::multiply_add(ep_p.mult_exp, ep_p.r, ep_p.mult_exp) +
- fputil::multiply_add(ep_m.mult_exp, ep_m.r, ep_m.mult_exp);
- return ep;
+
+ // TODO: We should be able to reduce the latency and reciprocal throughput
+ // further by using a low degree (maybe 3-7 ?) minimax polynomial for small
+ // but not too small inputs, such as |x| < 2^-2, or |x| < 2^-3.
+
+ // cosh(x) = (e^x + e^(-x)) / 2.
+ return exp_pm_eval</*is_sinh*/ false>(x);
}
} // namespace __llvm_libc
diff --git a/libc/src/math/generic/exp2f.cpp b/libc/src/math/generic/exp2f.cpp
index 913e6b743f679..411ae3359b540 100644
--- a/libc/src/math/generic/exp2f.cpp
+++ b/libc/src/math/generic/exp2f.cpp
@@ -16,23 +16,14 @@
#include <errno.h>
+#include "explogxf.h"
+
namespace __llvm_libc {
constexpr uint32_t exval1 = 0x3b42'9d37U;
constexpr uint32_t exval2 = 0xbcf3'a937U;
constexpr uint32_t exval_mask = exval1 & exval2;
-// Look up table for bit fields of 2^(i/16) for i = 0..15, generated by Sollya
-// with:
-// > for i from 0 to 15 do printdouble(round(2^(i/16), D, RN));
-constexpr int64_t EXP_2_M[16] = {
- 0x3ff0000000000000, 0x3ff0b5586cf9890f, 0x3ff172b83c7d517b,
- 0x3ff2387a6e756238, 0x3ff306fe0a31b715, 0x3ff3dea64c123422,
- 0x3ff4bfdad5362a27, 0x3ff5ab07dd485429, 0x3ff6a09e667f3bcd,
- 0x3ff7a11473eb0187, 0x3ff8ace5422aa0db, 0x3ff9c49182a3f090,
- 0x3ffae89f995ad3ad, 0x3ffc199bdd85529c, 0x3ffd5818dcfba487,
- 0x3ffea4afa2a490da};
-
LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
using FPBits = typename fputil::FPBits<float>;
FPBits xbits(x);
@@ -101,7 +92,7 @@ LLVM_LIBC_FUNCTION(float, exp2f, (float x)) {
// 2^mid is stored in the lookup table EXP_2_M of 16 elements.
// 2^lo is computed using a degree-6 minimax polynomial
// generated by Sollya.
- // We perform 2^hi * 2^lo by simply add hi to the exponent field
+ // We perform 2^hi * 2^mid by simply add hi to the exponent field
// of 2^mid.
// kf = (hi + mid) * 2^4 = round(x * 2^4)
diff --git a/libc/src/math/generic/explogxf.h b/libc/src/math/generic/explogxf.h
index 1688dde1ef749..b5639a8ac4194 100644
--- a/libc/src/math/generic/explogxf.h
+++ b/libc/src/math/generic/explogxf.h
@@ -30,6 +30,17 @@ constexpr double mmld = -1.0 / mlp;
// printf("%.13a,\n", d[i]);
extern const double EXP_2_POW[EXP_num_p];
+// Look up table for bit fields of 2^(i/16) for i = 0..15, generated by Sollya
+// with:
+// > for i from 0 to 15 do printdouble(round(2^(i/16), D, RN));
+inline constexpr int64_t EXP_2_M[16] = {
+ 0x3ff0000000000000, 0x3ff0b5586cf9890f, 0x3ff172b83c7d517b,
+ 0x3ff2387a6e756238, 0x3ff306fe0a31b715, 0x3ff3dea64c123422,
+ 0x3ff4bfdad5362a27, 0x3ff5ab07dd485429, 0x3ff6a09e667f3bcd,
+ 0x3ff7a11473eb0187, 0x3ff8ace5422aa0db, 0x3ff9c49182a3f090,
+ 0x3ffae89f995ad3ad, 0x3ffc199bdd85529c, 0x3ffd5818dcfba487,
+ 0x3ffea4afa2a490da};
+
constexpr int LOG_P1_BITS = 6;
constexpr int LOG_P1_SIZE = 1 << LOG_P1_BITS;
@@ -54,11 +65,18 @@ extern const double K_LOG2_EVEN[4];
// EXP_num_p Precise value of the constant is not needed.
static constexpr double LN2_INV = 0x1.71547652b82fep+0 * EXP_num_p;
+// log2(e) * 2^4
+static constexpr double LOG2_E_4 = 0x1.71547652b82fep+4;
+
// LN2_HIGH + LN2_LOW = ln(2) with precision higher than double(ln(2))
// Minus sign is to use FMA directly.
static constexpr double LN2_HIGH = -0x1.62e42fefa0000p-1 / EXP_num_p;
static constexpr double LN2_LOW = -0x1.cf79abc9e3b3ap-40 / EXP_num_p;
+// -log(2) * 2^(-4)
+static constexpr double M_LN2_4_HI = -0x1.62e42fefa0000p-5;
+static constexpr double M_LN2_4_LO = -0x1.cf79abc9e3b3ap-44;
+
struct exe_eval_result_t {
// exp(x) = 2^MULT_POWER2 * mult_exp * (r + 1.0)
// where
@@ -98,6 +116,106 @@ inline static exe_eval_result_t exp_eval(double x) {
return {mult_e1, r};
}
+// The function correctly calculates sinh(x) and cosh(x) by calculating exp(x)
+// and exp(-x) simultaneously.
+// To compute e^x, we perform the following range
+// reduction: find hi, mid, lo such that:
+// x = (hi + mid) * log(2) + lo, in which
+// hi is an integer,
+// 0 <= mid * 2^4 < 16 is an integer
+// -2^(-5) <= lo * log2(e) <= 2^-5.
+// In particular,
+// hi + mid = round(x * log2(e) * 2^4) * 2^(-4).
+// Then,
+// e^x = 2^(hi + mid) * e^lo = 2^hi * 2^mid * e^lo.
+// 2^mid is stored in the lookup table EXP_2_M of 16 elements.
+// e^lo is computed using a degree-6 minimax polynomial
+// generated by Sollya:
+// e^lo ~ P(lo) = 1 + lo + c2 * lo^2 + ... + c6 * lo^6
+// = (1 + c2*lo^2 + c4*lo^4 + c6*lo^6) + lo * (1 + c3*lo^2 + c5*lo^4)
+// = P_even + lo * P_odd
+// We perform 2^hi * 2^mid by simply add hi to the exponent field
+// of 2^mid.
+// To compute e^(-x), notice that:
+// e^(-x) = 2^(-(hi + mid)) * e^(-lo)
+// ~ 2^(-(hi + mid)) * P(-lo)
+// = 2^(-(hi + mid)) * (P_even - lo * P_odd)
+// So:
+// sinh(x) = (e^x - e^(-x)) / 2
+// ~ 0.5 * (2^(hi + mid) * (P_even + lo * P_odd) -
+// 2^(-(hi + mid)) * (P_even - lo * P_odd))
+// = 0.5 * (P_even * (2^(hi + mid) - 2^(-(hi + mid))) +
+// lo * P_odd * (2^(hi + mid) + 2^(-(hi + mid))))
+// And similarly:
+// cosh(x) = (e^x + e^(-x)) / 2
+// ~ 0.5 * (P_even * (2^(hi + mid) + 2^(-(hi + mid))) +
+// lo * P_odd * (2^(hi + mid) - 2^(-(hi + mid))))
+// The main point of these formulas is that the expensive part of calculating
+// the polynomials approximating lower parts of e^(x) and e^(-x) are shared
+// and only done once.
+template <bool is_sinh> static inline double exp_pm_eval(float x) {
+ double xd = static_cast<double>(x);
+
+ // round(x * log2(e) * 2^4)
+ double kd = fputil::nearest_integer(LOG2_E_4 * xd);
+
+ // k_p = round(x * log2(e) * 2^4)
+ int k_p = static_cast<int>(kd);
+ // k_m = round(-x * log2(e) * 2^4)
+ int k_m = -k_p;
+
+ // hi = floor(kf * 2^(-4))
+ // exp_hi = shift hi to the exponent field of double precision.
+ int64_t exp_hi_p = static_cast<int64_t>((k_p >> 4))
+ << fputil::FloatProperties<double>::MANTISSA_WIDTH;
+ int64_t exp_hi_m = static_cast<int64_t>((k_m >> 4))
+ << fputil::FloatProperties<double>::MANTISSA_WIDTH;
+ // mh = 2^hi * 2^mid
+ // mh_bits = bit field of mh
+ int64_t mh_bits_p = EXP_2_M[k_p & 15] + exp_hi_p;
+ int64_t mh_bits_m = EXP_2_M[k_m & 15] + exp_hi_m;
+ double mh_p = fputil::FPBits<double>(uint64_t(mh_bits_p)).get_val();
+ double mh_m = fputil::FPBits<double>(uint64_t(mh_bits_m)).get_val();
+ // mh_sum = 2^(hi + mid) + 2^(-(hi + mid))
+ double mh_sum = mh_p + mh_m;
+ // mh_
diff = 2^(hi + mid) - 2^(-(hi + mid))
+ double mh_
diff = mh_p - mh_m;
+
+ // dx = lo = x - (hi + mid) * log(2)
+ double dx = fputil::multiply_add(kd, M_LN2_4_LO,
+ fputil::multiply_add(kd, M_LN2_4_HI, xd));
+ double dx2 = dx * dx;
+
+ // Polynomials generated by Sollya with:
+ // Q = fpminimax(expm1(x)/x, 5, [|1, D...|], [-1/32*log(2), 1/32*log(2)]);
+ // Then:
+ // e^lo ~ P(dx) = 1 + dx + COEFFS[0] * dx^2 + ... + COEFFS[4] * dx^6.
+ constexpr double COEFFS[5] = {0x1.fffffffffffep-2, 0x1.55555554ad3f3p-3,
+ 0x1.55555557179cap-5, 0x1.111228f3478c9p-7,
+ 0x1.6c161beccc69dp-10};
+ // c0 = 1 + COEFFS[0] * lo^2
+ double c0 = fputil::multiply_add(dx2, COEFFS[0], 1.0);
+ // c1 = 1 + COEFFS[0] * lo^2
+ double c1 = fputil::multiply_add(dx2, COEFFS[1], 1.0);
+ // c2 = COEFFS[2] + COEFFS[4] * lo^2
+ double c2 = fputil::multiply_add(dx2, COEFFS[4], COEFFS[2]);
+ double dx4 = dx2 * dx2;
+ // P_even = c0 + c2 * lo^4
+ // = (1 + COEFFS[0] * lo^2) + lo^4 * (COEFFS[2] + COEFFS[4] * lo^2)
+ // = 1 + COEFFS[0] * lo^2 + COEFFS[2] * lo^4 + COEFFS[4] * lo^6
+ double p_even = fputil::multiply_add(dx4, c2, c0);
+ // P_odd = c1 + COEFFS[3] * lo^4
+ // = 1 + COEFFS[1] * lo^2 + COEFFS[3] * lo^4
+ double p_odd = fputil::multiply_add(dx4, COEFFS[3], c1);
+
+ double r;
+ if constexpr (is_sinh)
+ r = fputil::multiply_add(dx * mh_sum, p_odd, p_even * mh_
diff );
+ else
+ r = fputil::multiply_add(dx * mh_
diff , p_odd, p_even * mh_sum);
+ return 0.5 * r;
+}
+
// x should be positive, normal finite value
inline static double log2_eval(double x) {
using FPB = fputil::FPBits<double>;
diff --git a/libc/src/math/generic/sinhf.cpp b/libc/src/math/generic/sinhf.cpp
index 7b5ac60454a3b..8f16c381b5bc0 100644
--- a/libc/src/math/generic/sinhf.cpp
+++ b/libc/src/math/generic/sinhf.cpp
@@ -66,19 +66,8 @@ LLVM_LIBC_FUNCTION(float, sinhf, (float x)) {
return fputil::multiply_add(xdbl, pe, xdbl);
}
- // MULT_POWER2 = -1
- auto ep_p = exp_eval<-1>(x); // 0.5 * exp(x)
- auto ep_m = exp_eval<-1>(-x); // 0.5 * exp(-x)
-
- // 0.5 * expm1(x) = ep_p.mult_exp * (ep_p.r + 1) - 0.5
- // = ep_p.mult_exp * ep_p.r + ep_p.mult_exp - 0.5
- // 0.5 * expm1(-x) = ep_m.mult_exp * (ep_m.r + 1) - 0.5
- // = ep_m.mult_exp * ep_m.r + ep_m.mult_exp - 0.5
- // sinh(x) = 0.5 * expm1(x) - 0.5 * expm1(-x)
- // Using expm1 instead of exp improved precision around zero.
- double ep = fputil::multiply_add(ep_p.mult_exp, ep_p.r, ep_p.mult_exp - 0.5) -
- fputil::multiply_add(ep_m.mult_exp, ep_m.r, ep_m.mult_exp - 0.5);
- return ep;
+ // sinh(x) = (e^x - e^(-x)) / 2.
+ return exp_pm_eval</*is_sinh*/ true>(x);
}
} // namespace __llvm_libc
More information about the libc-commits
mailing list