[all-commits] [llvm/llvm-project] 68c011: [builtins] Optimize udivmodti4 for many platforms.

Danila Kutenin via All-commits all-commits at lists.llvm.org
Fri Jul 10 01:04:01 PDT 2020


  Branch: refs/heads/master
  Home:   https://github.com/llvm/llvm-project
  Commit: 68c011aa085ab8ec198198e45c83de605a7dc31f
      https://github.com/llvm/llvm-project/commit/68c011aa085ab8ec198198e45c83de605a7dc31f
  Author: Danila Kutenin <kutdanila at yandex.ru>
  Date:   2020-07-10 (Fri, 10 Jul 2020)

  Changed paths:
    M compiler-rt/lib/builtins/udivmodti4.c

  Log Message:
  -----------
  [builtins] Optimize udivmodti4 for many platforms.

Summary:
While benchmarking uint128 division we found out that it has huge latency for small divisors

https://reviews.llvm.org/D83027

```
Benchmark                                                   Time(ns)        CPU(ns)     Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            13.0           13.0     55000000
BM_DivideIntrinsic128UniformDivisor<__int128>                     14.3           14.3     50000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         13.5           13.5     52000000
BM_RemainderIntrinsic128UniformDivisor<__int128>                  14.1           14.1     50000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>             153            153        5000000
BM_DivideIntrinsic128SmallDivisor<__int128>                      170            170        3000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>          153            153        5000000
BM_RemainderIntrinsic128SmallDivisor<__int128>                   155            155        5000000
```

This patch suggests a more optimized version of the division:

If the divisor is 64 bit, we can proceed with the divq instruction on x86 or constant multiplication mechanisms for other platforms. Once both divisor and dividend are not less than 2**64, we use branch free subtract algorithm, it has at most 64 cycles. After that our benchmarks improved significantly

```
Benchmark                                                   Time(ns)        CPU(ns)     Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            11.0           11.0     64000000
BM_DivideIntrinsic128UniformDivisor<__int128>                     13.8           13.8     51000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         11.6           11.6     61000000
BM_RemainderIntrinsic128UniformDivisor<__int128>                  13.7           13.7     52000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>              27.1           27.1     26000000
BM_DivideIntrinsic128SmallDivisor<__int128>                       29.4           29.4     24000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>           27.9           27.8     26000000
BM_RemainderIntrinsic128SmallDivisor<__int128>                    29.1           29.1     25000000
```

If not using divq instrinsics, it is still much better

```
Benchmark                                                   Time(ns)        CPU(ns)     Iterations
--------------------------------------------------------------------------------------------------
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            12.2           12.2     58000000
BM_DivideIntrinsic128UniformDivisor<__int128>                     13.5           13.5     52000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         12.7           12.7     56000000
BM_RemainderIntrinsic128UniformDivisor<__int128>                  13.7           13.7     51000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>              30.2           30.2     24000000
BM_DivideIntrinsic128SmallDivisor<__int128>                       33.2           33.2     22000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>           31.4           31.4     23000000
BM_RemainderIntrinsic128SmallDivisor<__int128>                    33.8           33.8     21000000
```

PowerPC benchmarks:

Was
```
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            22.3           22.3     32000000
BM_DivideIntrinsic128UniformDivisor<__int128>                     23.8           23.8     30000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         22.5           22.5     32000000
BM_RemainderIntrinsic128UniformDivisor<__int128>                  24.9           24.9     29000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>             394            394        2000000
BM_DivideIntrinsic128SmallDivisor<__int128>                      397            397        2000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>          399            399        2000000
BM_RemainderIntrinsic128SmallDivisor<__int128>                   397            397        2000000
```

With this patch
```
BM_DivideIntrinsic128UniformDivisor<unsigned __int128>            21.7           21.7     33000000
BM_DivideIntrinsic128UniformDivisor<__int128>                     23.0           23.0     31000000
BM_RemainderIntrinsic128UniformDivisor<unsigned __int128>         21.9           21.9     33000000
BM_RemainderIntrinsic128UniformDivisor<__int128>                  23.9           23.9     30000000
BM_DivideIntrinsic128SmallDivisor<unsigned __int128>              32.7           32.6     23000000
BM_DivideIntrinsic128SmallDivisor<__int128>                       33.4           33.4     21000000
BM_RemainderIntrinsic128SmallDivisor<unsigned __int128>           31.1           31.1     22000000
BM_RemainderIntrinsic128SmallDivisor<__int128>                    33.2           33.2     22000000
```

My email: danilak at google.com, I don't have commit rights

Reviewers: howard.hinnant, courbet, MaskRay

Reviewed By: courbet

Subscribers: steven.zhang, #sanitizers

Tags: #sanitizers

Differential Revision: https://reviews.llvm.org/D81809




More information about the All-commits mailing list