[all-commits] [llvm/llvm-project] 68c011: [builtins] Optimize udivmodti4 for many platforms.
Danila Kutenin via All-commits
all-commits at lists.llvm.org
Fri Jul 10 01:04:01 PDT 2020
Branch: refs/heads/master
Home: https://github.com/llvm/llvm-project
Commit: 68c011aa085ab8ec198198e45c83de605a7dc31f
https://github.com/llvm/llvm-project/commit/68c011aa085ab8ec198198e45c83de605a7dc31f
Author: Danila Kutenin <kutdanila at yandex.ru>
Date: 2020-07-10 (Fri, 10 Jul 2020)
Changed paths:
M compiler-rt/lib/builtins/udivmodti4.c
Log Message:
-----------
[builtins] Optimize udivmodti4 for many platforms.
Summary:
While benchmarking uint128 division we found out that it has huge latency for small divisors
https://reviews.llvm.org/D83027
```
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 13.0 13.0 55000000
BM_DivideIntrinsic128UniformDivisor<__int128> 14.3 14.3 50000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 13.5 13.5 52000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 14.1 14.1 50000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 153 153 5000000
BM_DivideIntrinsic128SmallDivisor<__int128> 170 170 3000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 153 153 5000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 155 155 5000000
```
This patch suggests a more optimized version of the division:
If the divisor is 64 bit, we can proceed with the divq instruction on x86 or constant multiplication mechanisms for other platforms. Once both divisor and dividend are not less than 2**64, we use branch free subtract algorithm, it has at most 64 cycles. After that our benchmarks improved significantly
```
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 11.0 11.0 64000000
BM_DivideIntrinsic128UniformDivisor<__int128> 13.8 13.8 51000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 11.6 11.6 61000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 13.7 13.7 52000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 27.1 27.1 26000000
BM_DivideIntrinsic128SmallDivisor<__int128> 29.4 29.4 24000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 27.9 27.8 26000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 29.1 29.1 25000000
```
If not using divq instrinsics, it is still much better
```
Benchmark Time(ns) CPU(ns) Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 12.2 12.2 58000000
BM_DivideIntrinsic128UniformDivisor<__int128> 13.5 13.5 52000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 12.7 12.7 56000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 13.7 13.7 51000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 30.2 30.2 24000000
BM_DivideIntrinsic128SmallDivisor<__int128> 33.2 33.2 22000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 31.4 31.4 23000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 33.8 33.8 21000000
```
PowerPC benchmarks:
Was
```
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 22.3 22.3 32000000
BM_DivideIntrinsic128UniformDivisor<__int128> 23.8 23.8 30000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 22.5 22.5 32000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 24.9 24.9 29000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 394 394 2000000
BM_DivideIntrinsic128SmallDivisor<__int128> 397 397 2000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 399 399 2000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 397 397 2000000
```
With this patch
```
BM_DivideIntrinsic128UniformDivisor<unsigned __int128> 21.7 21.7 33000000
BM_DivideIntrinsic128UniformDivisor<__int128> 23.0 23.0 31000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128> 21.9 21.9 33000000
BM_RemainderIntrinsic128UniformDivisor<__int128> 23.9 23.9 30000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128> 32.7 32.6 23000000
BM_DivideIntrinsic128SmallDivisor<__int128> 33.4 33.4 21000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128> 31.1 31.1 22000000
BM_RemainderIntrinsic128SmallDivisor<__int128> 33.2 33.2 22000000
```
My email: danilak at google.com, I don't have commit rights
Reviewers: howard.hinnant, courbet, MaskRay
Reviewed By: courbet
Subscribers: steven.zhang, #sanitizers
Tags: #sanitizers
Differential Revision: https://reviews.llvm.org/D81809
More information about the All-commits
mailing list