[PATCH] D69295: Optimize SHA1 implementation

Mon Oct 21 20:50:24 PDT 2019

terrelln created this revision.
terrelln added reviewers: ruiu, MaskRay.
Herald added subscribers: hiraditya, mgorny.
Herald added a project: LLVM.
terrelln edited the summary of this revision.
terrelln edited the summary of this revision.
terrelln edited the summary of this revision.

- Add inline to the helper functions because gcc-9 won't inline all of them without the hint. I've avoided `__attribute__((always_inline))` because gcc and clang will inline without it, and improves compatibility.
- Replace the byte-by-byte copy in update() with endian::readbe32() since perf reports that 1/2 of the time is spent copying into the buffer before this patch.
- Add a hash-benchmark to measure the performance improvement.

When lld uses --build-id=sha1 it spends 30-45% of CPU in SHA1 depending on the binary (not wall-time since it is parallel). This patch speeds up SHA1 by a factor of 2 on clang-8 and 3 on gcc-6. This leads to a >10% improvement in overall linking time.

Unit tests
==========

  ninja check-llvm

LLD speed
=========

lld-speed-test benchmarks run on an Intel i9-9900k with Turbo disabled on CPU 0 compiled with clang-9. Stats recorded with `perf stat -r 5`. All inputs are using `--build-id=sha1`.

| Input           | Before (seconds) | After (seconds) |
| --------------- | ---------------- | --------------- |
| chrome          | 2.14             | 1.82 (-15%)     |
| chrome-icf      | 2.56             | 2.29 (-10%)     |
| clang           | 0.65             | 0.53 (-18%)     |
| clang-fsds      | 0.69             | 0.58 (-16%)     |
| clang-gdb-index | 21.71            | 19.3 (-11%)     |
| gold            | 0.42             | 0.34 (-19%)     |
| gold-fsds       | 0.431            | 0.355 (-17%)    |
| linux-kernel    | 0.625            | 0.575 (-8%)     |
| llvm-as         | 0.045            | 0.039 (-14%)    |
| llvm-as-fsds    | 0.035            | 0.039 (-11%)    |
| mozilla         | 11.3             | 9.8  (-13%)     |
| mozilla-gc      | 11.84            | 10.36 (-12%)    |
| mozilla-O0      | 8.2              | 5.84 (-28%)     |
| scylla          | 5.59             | 4.52 (-19%)     |
|

Microbenchmarks
===============

Compiled with clang-8:

Before:

  2019-10-16 11:33:41
  Running ./benchmarks/hash-benchmark/hash-benchmark
  Run on (24 X 2394.48 MHz CPU s)
  CPU Caches:
    L1 Data 32K (x24)
    L1 Instruction 32K (x24)
    L2 Unified 4096K (x24)
    L3 Unified 16384K (x24)
  -----------------------------------------------------------
  Benchmark                    Time           CPU Iterations
  -----------------------------------------------------------
  BM_SHA1/1024              5146 ns       5145 ns     137203
  BM_SHA1/4096             20043 ns      20040 ns      32644
  BM_SHA1/32768           154810 ns     154803 ns       4401
  BM_SHA1/262144         1281332 ns    1281244 ns        555
  BM_SHA1/1048576        5154688 ns    5154100 ns        137

After:

  2019-10-16 11:34:20
  Running ./benchmarks/hash-benchmark/hash-benchmark
  Run on (24 X 2394.48 MHz CPU s)
  CPU Caches:
    L1 Data 32K (x24)
    L1 Instruction 32K (x24)
    L2 Unified 4096K (x24)
    L3 Unified 16384K (x24)
  -----------------------------------------------------------
  Benchmark                    Time           CPU Iterations
  -----------------------------------------------------------
  BM_SHA1/1024              3071 ns       3070 ns     241890
  BM_SHA1/4096             10491 ns      10491 ns      64873
  BM_SHA1/32768            82802 ns      82791 ns       8533
  BM_SHA1/262144          685598 ns     685595 ns       1069
  BM_SHA1/1048576        2593819 ns    2593495 ns        265

Compiled with gcc-6:

Before:

  2019-10-16 11:36:05
  Running ./benchmarks/hash-benchmark/hash-benchmark
  Run on (24 X 2394.48 MHz CPU s)
  CPU Caches:
    L1 Data 32K (x24)
    L1 Instruction 32K (x24)
    L2 Unified 4096K (x24)
    L3 Unified 16384K (x24)
  -----------------------------------------------------------
  Benchmark                    Time           CPU Iterations
  -----------------------------------------------------------
  BM_SHA1/1024              8770 ns       8769 ns      80651
  BM_SHA1/4096             34161 ns      34159 ns      20583
  BM_SHA1/32768           271183 ns     271154 ns       2565
  BM_SHA1/262144         2140979 ns    2140434 ns        332
  BM_SHA1/1048576        8376018 ns    8374622 ns         83

After:

  2019-10-16 11:34:58
  Running ./benchmarks/hash-benchmark/hash-benchmark
  Run on (24 X 2394.48 MHz CPU s)
  CPU Caches:
    L1 Data 32K (x24)
    L1 Instruction 32K (x24)
    L2 Unified 4096K (x24)
    L3 Unified 16384K (x24)
  -----------------------------------------------------------
  Benchmark                    Time           CPU Iterations
  -----------------------------------------------------------
  BM_SHA1/1024              2892 ns       2892 ns     254677
  BM_SHA1/4096             10300 ns      10299 ns      72058
  BM_SHA1/32768            82527 ns      82527 ns       8880
  BM_SHA1/262144          629433 ns     629358 ns       1080
  BM_SHA1/1048576        2669301 ns    2669137 ns        272

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D69295

Files:
  llvm/benchmarks/CMakeLists.txt
  llvm/benchmarks/hash-benchmark/CMakeLists.txt
  llvm/benchmarks/hash-benchmark/hash-benchmark.cpp
  llvm/lib/Support/SHA1.cpp
  llvm/unittests/Support/raw_sha1_ostream_test.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D69295.225991.patch
Type: text/x-patch
Size: 7539 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191022/09cd4a16/attachment.bin>