[compiler-rt] 68c011a - [builtins] Optimize udivmodti4 for many platforms.
Clement Courbet via llvm-commits
llvm-commits at lists.llvm.org
Fri Jul 10 01:03:50 PDT 2020
Author: Danila Kutenin
Date: 2020-07-10T09:59:16+02:00
New Revision: 68c011aa085ab8ec198198e45c83de605a7dc31f
URL: https://github.com/llvm/llvm-project/commit/68c011aa085ab8ec198198e45c83de605a7dc31f
DIFF: https://github.com/llvm/llvm-project/commit/68c011aa085ab8ec198198e45c83de605a7dc31f.diff
LOG: [builtins] Optimize udivmodti4 for many platforms.
Summary:
While benchmarking uint128 division we found out that it has huge latency for small divisors
https://reviews.llvm.org/D83027
```
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 13.0 13.0 55000000
BM_DivideIntrinsic128UniformDivisor<__int128> 14.3 14.3 50000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 13.5 13.5 52000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 14.1 14.1 50000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 153 153 5000000
BM_DivideIntrinsic128SmallDivisor<__int128> 170 170 3000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 153 153 5000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 155 155 5000000
```
This patch suggests a more optimized version of the division:
If the divisor is 64 bit, we can proceed with the divq instruction on x86 or constant multiplication mechanisms for other platforms. Once both divisor and dividend are not less than 2**64, we use branch free subtract algorithm, it has at most 64 cycles. After that our benchmarks improved significantly
```
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 11.0 11.0 64000000
BM_DivideIntrinsic128UniformDivisor<__int128> 13.8 13.8 51000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 11.6 11.6 61000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 13.7 13.7 52000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 27.1 27.1 26000000
BM_DivideIntrinsic128SmallDivisor<__int128> 29.4 29.4 24000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 27.9 27.8 26000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 29.1 29.1 25000000
```
If not using divq instrinsics, it is still much better
```
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 12.2 12.2 58000000
BM_DivideIntrinsic128UniformDivisor<__int128> 13.5 13.5 52000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 12.7 12.7 56000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 13.7 13.7 51000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 30.2 30.2 24000000
BM_DivideIntrinsic128SmallDivisor<__int128> 33.2 33.2 22000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 31.4 31.4 23000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 33.8 33.8 21000000
```
PowerPC benchmarks:
Was
```
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 22.3 22.3 32000000
BM_DivideIntrinsic128UniformDivisor<__int128> 23.8 23.8 30000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 22.5 22.5 32000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 24.9 24.9 29000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 394 394 2000000
BM_DivideIntrinsic128SmallDivisor<__int128> 397 397 2000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 399 399 2000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 397 397 2000000
```
With this patch
```
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 21.7 21.7 33000000
BM_DivideIntrinsic128UniformDivisor<__int128> 23.0 23.0 31000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 21.9 21.9 33000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 23.9 23.9 30000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 32.7 32.6 23000000
BM_DivideIntrinsic128SmallDivisor<__int128> 33.4 33.4 21000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 31.1 31.1 22000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 33.2 33.2 22000000
```
My email: danilak at google.com, I don't have commit rights
Reviewers: howard.hinnant, courbet, MaskRay
Reviewed By: courbet
Subscribers: steven.zhang, #sanitizers
Tags: #sanitizers
Differential Revision: https://reviews.llvm.org/D81809
Added:
Modified:
compiler-rt/lib/builtins/udivmodti4.c
Removed:
################################################################################
diff --git a/compiler-rt/lib/builtins/udivmodti4.c b/compiler-rt/lib/builtins/udivmodti4.c
index dd14a8b579ca..55def37c9e1f 100644
--- a/compiler-rt/lib/builtins/udivmodti4.c
+++ b/compiler-rt/lib/builtins/udivmodti4.c
@@ -14,182 +14,145 @@
#ifdef CRT_HAS_128BIT
+// Returns the 128 bit division result by 64 bit. Result must fit in 64 bits.
+// Remainder stored in r.
+// Taken and adjusted from libdivide libdivide_128_div_64_to_64 division
+// fallback. For a correctness proof see the reference for this algorithm
+// in Knuth, Volume 2, section 4.3.1, Algorithm D.
+UNUSED
+static inline du_int udiv128by64to64default(du_int u1, du_int u0, du_int v,
+ du_int *r) {
+ const unsigned n_udword_bits = sizeof(du_int) * CHAR_BIT;
+ const du_int b = (1ULL << (n_udword_bits / 2)); // Number base (32 bits)
+ du_int un1, un0; // Norm. dividend LSD's
+ du_int vn1, vn0; // Norm. divisor digits
+ du_int q1, q0; // Quotient digits
+ du_int un64, un21, un10; // Dividend digit pairs
+ du_int rhat; // A remainder
+ si_int s; // Shift amount for normalization
+
+ s = __builtin_clzll(v);
+ if (s > 0) {
+ // Normalize the divisor.
+ v = v << s;
+ un64 = (u1 << s) | (u0 >> (n_udword_bits - s));
+ un10 = u0 << s; // Shift dividend left
+ } else {
+ // Avoid undefined behavior of (u0 >> 64).
+ un64 = u1;
+ un10 = u0;
+ }
+
+ // Break divisor up into two 32-bit digits.
+ vn1 = v >> (n_udword_bits / 2);
+ vn0 = v & 0xFFFFFFFF;
+
+ // Break right half of dividend into two digits.
+ un1 = un10 >> (n_udword_bits / 2);
+ un0 = un10 & 0xFFFFFFFF;
+
+ // Compute the first quotient digit, q1.
+ q1 = un64 / vn1;
+ rhat = un64 - q1 * vn1;
+
+ // q1 has at most error 2. No more than 2 iterations.
+ while (q1 >= b || q1 * vn0 > b * rhat + un1) {
+ q1 = q1 - 1;
+ rhat = rhat + vn1;
+ if (rhat >= b)
+ break;
+ }
+
+ un21 = un64 * b + un1 - q1 * v;
+
+ // Compute the second quotient digit.
+ q0 = un21 / vn1;
+ rhat = un21 - q0 * vn1;
+
+ // q0 has at most error 2. No more than 2 iterations.
+ while (q0 >= b || q0 * vn0 > b * rhat + un0) {
+ q0 = q0 - 1;
+ rhat = rhat + vn1;
+ if (rhat >= b)
+ break;
+ }
+
+ *r = (un21 * b + un0 - q0 * v) >> s;
+ return q1 * b + q0;
+}
+
+static inline du_int udiv128by64to64(du_int u1, du_int u0, du_int v,
+ du_int *r) {
+#if defined(__x86_64__)
+ du_int result;
+ __asm__("divq %[v]"
+ : "=a"(result), "=d"(*r)
+ : [ v ] "r"(v), "a"(u0), "d"(u1));
+ return result;
+#else
+ return udiv128by64to64default(u1, u0, v, r);
+#endif
+}
+
// Effects: if rem != 0, *rem = a % b
// Returns: a / b
-// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide
-
COMPILER_RT_ABI tu_int __udivmodti4(tu_int a, tu_int b, tu_int *rem) {
- const unsigned n_udword_bits = sizeof(du_int) * CHAR_BIT;
const unsigned n_utword_bits = sizeof(tu_int) * CHAR_BIT;
- utwords n;
- n.all = a;
- utwords d;
- d.all = b;
- utwords q;
- utwords r;
- unsigned sr;
- // special cases, X is unknown, K != 0
- if (n.s.high == 0) {
- if (d.s.high == 0) {
- // 0 X
- // ---
- // 0 X
- if (rem)
- *rem = n.s.low % d.s.low;
- return n.s.low / d.s.low;
- }
- // 0 X
- // ---
- // K X
+ utwords dividend;
+ dividend.all = a;
+ utwords divisor;
+ divisor.all = b;
+ utwords quotient;
+ utwords remainder;
+ if (divisor.all > dividend.all) {
if (rem)
- *rem = n.s.low;
+ *rem = dividend.all;
return 0;
}
- // n.s.high != 0
- if (d.s.low == 0) {
- if (d.s.high == 0) {
- // K X
- // ---
- // 0 0
- if (rem)
- *rem = n.s.high % d.s.low;
- return n.s.high / d.s.low;
- }
- // d.s.high != 0
- if (n.s.low == 0) {
- // K 0
- // ---
- // K 0
- if (rem) {
- r.s.high = n.s.high % d.s.high;
- r.s.low = 0;
- *rem = r.all;
- }
- return n.s.high / d.s.high;
- }
- // K K
- // ---
- // K 0
- if ((d.s.high & (d.s.high - 1)) == 0) /* if d is a power of 2 */ {
- if (rem) {
- r.s.low = n.s.low;
- r.s.high = n.s.high & (d.s.high - 1);
- *rem = r.all;
- }
- return n.s.high >> __builtin_ctzll(d.s.high);
- }
- // K K
- // ---
- // K 0
- sr = __builtin_clzll(d.s.high) - __builtin_clzll(n.s.high);
- // 0 <= sr <= n_udword_bits - 2 or sr large
- if (sr > n_udword_bits - 2) {
- if (rem)
- *rem = n.all;
- return 0;
- }
- ++sr;
- // 1 <= sr <= n_udword_bits - 1
- // q.all = n.all << (n_utword_bits - sr);
- q.s.low = 0;
- q.s.high = n.s.low << (n_udword_bits - sr);
- // r.all = n.all >> sr;
- r.s.high = n.s.high >> sr;
- r.s.low = (n.s.high << (n_udword_bits - sr)) | (n.s.low >> sr);
- } else /* d.s.low != 0 */ {
- if (d.s.high == 0) {
- // K X
- // ---
- // 0 K
- if ((d.s.low & (d.s.low - 1)) == 0) /* if d is a power of 2 */ {
- if (rem)
- *rem = n.s.low & (d.s.low - 1);
- if (d.s.low == 1)
- return n.all;
- sr = __builtin_ctzll(d.s.low);
- q.s.high = n.s.high >> sr;
- q.s.low = (n.s.high << (n_udword_bits - sr)) | (n.s.low >> sr);
- return q.all;
- }
- // K X
- // ---
- // 0 K
- sr = 1 + n_udword_bits + __builtin_clzll(d.s.low) -
- __builtin_clzll(n.s.high);
- // 2 <= sr <= n_utword_bits - 1
- // q.all = n.all << (n_utword_bits - sr);
- // r.all = n.all >> sr;
- if (sr == n_udword_bits) {
- q.s.low = 0;
- q.s.high = n.s.low;
- r.s.high = 0;
- r.s.low = n.s.high;
- } else if (sr < n_udword_bits) /* 2 <= sr <= n_udword_bits - 1 */ {
- q.s.low = 0;
- q.s.high = n.s.low << (n_udword_bits - sr);
- r.s.high = n.s.high >> sr;
- r.s.low = (n.s.high << (n_udword_bits - sr)) | (n.s.low >> sr);
- } else /* n_udword_bits + 1 <= sr <= n_utword_bits - 1 */ {
- q.s.low = n.s.low << (n_utword_bits - sr);
- q.s.high = (n.s.high << (n_utword_bits - sr)) |
- (n.s.low >> (sr - n_udword_bits));
- r.s.high = 0;
- r.s.low = n.s.high >> (sr - n_udword_bits);
- }
+ // When the divisor fits in 64 bits, we can use an optimized path.
+ if (divisor.s.high == 0) {
+ remainder.s.high = 0;
+ if (dividend.s.high < divisor.s.low) {
+ // The result fits in 64 bits.
+ quotient.s.low = udiv128by64to64(dividend.s.high, dividend.s.low,
+ divisor.s.low, &remainder.s.low);
+ quotient.s.high = 0;
} else {
- // K X
- // ---
- // K K
- sr = __builtin_clzll(d.s.high) - __builtin_clzll(n.s.high);
- // 0 <= sr <= n_udword_bits - 1 or sr large
- if (sr > n_udword_bits - 1) {
- if (rem)
- *rem = n.all;
- return 0;
- }
- ++sr;
- // 1 <= sr <= n_udword_bits
- // q.all = n.all << (n_utword_bits - sr);
- // r.all = n.all >> sr;
- q.s.low = 0;
- if (sr == n_udword_bits) {
- q.s.high = n.s.low;
- r.s.high = 0;
- r.s.low = n.s.high;
- } else {
- r.s.high = n.s.high >> sr;
- r.s.low = (n.s.high << (n_udword_bits - sr)) | (n.s.low >> sr);
- q.s.high = n.s.low << (n_udword_bits - sr);
- }
+ // First, divide with the high part to get the remainder in dividend.s.high.
+ // After that dividend.s.high < divisor.s.low.
+ quotient.s.high = dividend.s.high / divisor.s.low;
+ dividend.s.high = dividend.s.high % divisor.s.low;
+ quotient.s.low = udiv128by64to64(dividend.s.high, dividend.s.low,
+ divisor.s.low, &remainder.s.low);
}
+ if (rem)
+ *rem = remainder.all;
+ return quotient.all;
}
- // Not a special case
- // q and r are initialized with:
- // q.all = n.all << (n_utword_bits - sr);
- // r.all = n.all >> sr;
- // 1 <= sr <= n_utword_bits - 1
- su_int carry = 0;
- for (; sr > 0; --sr) {
- // r:q = ((r:q) << 1) | carry
- r.s.high = (r.s.high << 1) | (r.s.low >> (n_udword_bits - 1));
- r.s.low = (r.s.low << 1) | (q.s.high >> (n_udword_bits - 1));
- q.s.high = (q.s.high << 1) | (q.s.low >> (n_udword_bits - 1));
- q.s.low = (q.s.low << 1) | carry;
- // carry = 0;
- // if (r.all >= d.all)
+ // 0 <= shift <= 63.
+ si_int shift =
+ __builtin_clzll(divisor.s.high) - __builtin_clzll(dividend.s.high);
+ divisor.all <<= shift;
+ quotient.s.high = 0;
+ quotient.s.low = 0;
+ for (; shift >= 0; --shift) {
+ quotient.s.low <<= 1;
+ // Branch free version of.
+ // if (dividend.all >= divisor.all)
// {
- // r.all -= d.all;
- // carry = 1;
+ // dividend.all -= divisor.all;
+ // carry = 1;
// }
- const ti_int s = (ti_int)(d.all - r.all - 1) >> (n_utword_bits - 1);
- carry = s & 1;
- r.all -= d.all & s;
+ const ti_int s =
+ (ti_int)(divisor.all - dividend.all - 1) >> (n_utword_bits - 1);
+ quotient.s.low |= s & 1;
+ dividend.all -= divisor.all & s;
+ divisor.all >>= 1;
}
- q.all = (q.all << 1) | carry;
if (rem)
- *rem = r.all;
- return q.all;
+ *rem = dividend.all;
+ return quotient.all;
}
#endif // CRT_HAS_128BIT
More information about the llvm-commits
mailing list