[libcxx-commits] [PATCH] D63063: Bug 42208: speeding up std::merge

Denis Yaroshevskiy via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Sun Jun 9 13:01:34 PDT 2019


dyaroshev created this revision.
Herald added subscribers: libcxx-commits, ldionne, dmgreen, mgrang.

Benchmarking results:

Before:

  ----------------------------------------------------------------------------------------
  Benchmark                                             Time             CPU   Iterations
  ----------------------------------------------------------------------------------------
  MergeBench_MergeAlg_TestInt32/512                  1848 ns         1846 ns       378311
  MergeBench_MergeAlg_TestInt32/2048                27005 ns        26989 ns        25576
  MergeBench_MergeAlg_TestInt32/2097152          33327173 ns     33314619 ns           21
  MergeBench_MergeAlg_TestInt64/512                  1702 ns         1701 ns       407486
  MergeBench_MergeAlg_TestInt64/2048                26351 ns        26340 ns        26949
  MergeBench_MergeAlg_TestInt64/2097152          34492582 ns     34484900 ns           20
  MergeBench_MergeAlg_TestUint32/512                 1755 ns         1755 ns       379830
  MergeBench_MergeAlg_TestUint32/2048               25971 ns        25963 ns        26655
  MergeBench_MergeAlg_TestUint32/2097152         37003864 ns     35490619 ns           21
  MergeBench_MergeAlg_TestMediumString/512         234988 ns       234489 ns         2641
  MergeBench_MergeAlg_TestMediumString/2048       1120958 ns      1062598 ns          615
  MergeBench_MergeAlg_TestMediumString/2097152 2595634493 ns   2590478000 ns            1

After:

  ----------------------------------------------------------------------------------------
  Benchmark                                             Time             CPU   Iterations
  ----------------------------------------------------------------------------------------
  MergeBench_MergeAlg_TestInt32/512                  1396 ns         1395 ns       485652  (25% speedup)
  MergeBench_MergeAlg_TestInt32/2048                15691 ns        15682 ns        43672  (42% speedup)
  MergeBench_MergeAlg_TestInt32/2097152          30340879 ns     30329130 ns           23  (9% speedup)
  MergeBench_MergeAlg_TestInt64/512                  1567 ns         1566 ns       440998  (9% speedup)
  MergeBench_MergeAlg_TestInt64/2048                25090 ns        25076 ns        27286  (5% speedup)
  MergeBench_MergeAlg_TestInt64/2097152          32398209 ns     32394000 ns           22  (7% speedup)
  MergeBench_MergeAlg_TestUint32/512                 1366 ns         1366 ns       507957  (23% speedup)
  MergeBench_MergeAlg_TestUint32/2048               15713 ns        15706 ns        43127  (40% speedup)
  MergeBench_MergeAlg_TestUint32/2097152         30373730 ns     30366261 ns           23  (18% speedup)
  MergeBench_MergeAlg_TestMediumString/512         213092 ns       212974 ns         3253  (10% speedup)
  MergeBench_MergeAlg_TestMediumString/2048        879484 ns       879021 ns          752  (22% speedup)
  MergeBench_MergeAlg_TestMediumString/2097152 2156054708 ns   2155483000 ns            1  (17% speedup)

There are two issues with current implementation of std::merge:
https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L4353

1. The algorithm does two checks for boundary on every iteration, even though we only move one of the iterators
2. If one of the checks for left boundary is unrolled we get better loop structures for both 1 and 2 ranges being bigger.

The speed up for the 1 range dominating on some measurements I did gets up to 1.7 times, while the 2 - about 1.4

If you want to play with algorithms/parameters - you can do that on quick-bench.
Watch out for code alignment issues!! - unfortunately including all three benchmarks
in the binary will result in incorrect result.
Link: http://quick-bench.com/kWbYdPDFnrovXWnuF6xw5wK27B8

Binary size increase (godbolt: https://godbolt.org/z/b1ZFTA):
For std::string the size grows from 394 assembly instructions to 465 instructions (18%).
For int - from 62 to 64 (3%).
Considering other places in libcxx that specialize algorithms for specific sizes seems acceptable.
(

  for example sort: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L3703
  rotate: https://github.com/llvm-mirror/libcxx/blob/1f60111b597e5cb80a4513ec86f79b7e137f7793/include/algorithm#L2388
  (if I understand the rotate algorithm correctly)

)

Potential followups:
std::stable_sort, std::inplace_merge are relying on merge - but reimplement it currently from scratch.
This could be a useful improvement.
std::set_union/std::set_difference have some similar problems to std::merge and could be improved in a similar maner.


Repository:
  rCXX libc++

https://reviews.llvm.org/D63063

Files:
  benchmarks/algorithms.merge.bench.cpp
  include/algorithm

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D63063.203750.patch
Type: text/x-patch
Size: 5215 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libcxx-commits/attachments/20190609/f5a9097f/attachment.bin>


More information about the libcxx-commits mailing list