[libcxx-commits] [PATCH] D131395: [libc++] Implement `lexicographical_compare_three_way`
Adrian Vogelsgesang via Phabricator via libcxx-commits
libcxx-commits at lists.llvm.org
Sat Nov 19 17:30:24 PST 2022
avogelsgesang marked 3 inline comments as done.
avogelsgesang added a comment.
@var-const As requested, I updated the benchmark to benchmark `int*` on both the fast and the slow path.
Previously, you proposed
> Can you please redo the benchmark in a different way by doing some local (not to be committed) modifications? Rather than comparing the results for int* versus cpp17_input_iterator<int*>, remove the input iterator tests completely, run the int* benchmark against the current implementation, then remove the optimization and rerun the benchmark. That way, it would be a proper "apples-to-apples" comparison: int* vs. int* with or without the optimization. (I don't expect results to change much, but better be sure)
Instead, I now factored out the slow and fast path into two separate functions (as proposed in another comment) and call both of them directly from the benchmark. That way, we can also rerun the benchmark in the future, and e.g. remove the optimization from libc++ again, as soon as clang's optimizations are smart enough to figure this out by themselves. Do you agree with the updated benchmark?
The results on my local machine (compiled using clang commit `aae08b1d372a45b8bef95e86e5fe9110045eb78d`):
Running ./libcxx/benchmarks/lexicographical_compare_three_way.libcxx.out
Run on (40 X 800.102 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x20)
L1 Instruction 32 KiB (x20)
L2 Unified 1024 KiB (x20)
L3 Unified 14080 KiB (x2)
Load Average: 0.00, 0.48, 0.67
---------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------------------------------------------
BM_lexicographical_compare_three_way_slow_path/1 1.67 ns 1.67 ns 415947081
BM_lexicographical_compare_three_way_slow_path/4 4.68 ns 4.68 ns 149519461
BM_lexicographical_compare_three_way_slow_path/16 16.7 ns 16.7 ns 41824440
BM_lexicographical_compare_three_way_slow_path/64 64.9 ns 64.9 ns 10711142
BM_lexicographical_compare_three_way_slow_path/256 266 ns 266 ns 2630928
BM_lexicographical_compare_three_way_slow_path/1024 1036 ns 1036 ns 676167
BM_lexicographical_compare_three_way_slow_path/4096 4122 ns 4121 ns 169817
BM_lexicographical_compare_three_way_slow_path/16384 16453 ns 16453 ns 42559
BM_lexicographical_compare_three_way_slow_path/65536 65827 ns 65826 ns 10619
BM_lexicographical_compare_three_way_slow_path/262144 263147 ns 263140 ns 2658
BM_lexicographical_compare_three_way_slow_path/1048576 1060909 ns 1060813 ns 659
BM_lexicographical_compare_three_way_fast_path/1 2.34 ns 2.34 ns 298968436
BM_lexicographical_compare_three_way_fast_path/4 4.76 ns 4.76 ns 148425440
BM_lexicographical_compare_three_way_fast_path/16 10.1 ns 10.1 ns 68735401
BM_lexicographical_compare_three_way_fast_path/64 28.8 ns 28.8 ns 24262243
BM_lexicographical_compare_three_way_fast_path/256 123 ns 123 ns 5685144
BM_lexicographical_compare_three_way_fast_path/1024 444 ns 444 ns 1575722
BM_lexicographical_compare_three_way_fast_path/4096 1749 ns 1749 ns 400100
BM_lexicographical_compare_three_way_fast_path/16384 6898 ns 6897 ns 101189
BM_lexicographical_compare_three_way_fast_path/65536 28537 ns 28536 ns 24520
BM_lexicographical_compare_three_way_fast_path/262144 125793 ns 125791 ns 5503
BM_lexicographical_compare_three_way_fast_path/1048576 506020 ns 505976 ns 1379
BM_lexicographical_compare_three_way<int*>/1 1.67 ns 1.67 ns 418741799
BM_lexicographical_compare_three_way<int*>/4 3.68 ns 3.68 ns 190328759
BM_lexicographical_compare_three_way<int*>/16 8.70 ns 8.70 ns 80464881
BM_lexicographical_compare_three_way<int*>/64 28.9 ns 28.9 ns 24261824
BM_lexicographical_compare_three_way<int*>/256 125 ns 125 ns 5593129
BM_lexicographical_compare_three_way<int*>/1024 448 ns 448 ns 1562365
BM_lexicographical_compare_three_way<int*>/4096 1751 ns 1750 ns 399745
BM_lexicographical_compare_three_way<int*>/16384 6895 ns 6895 ns 101193
BM_lexicographical_compare_three_way<int*>/65536 28741 ns 28740 ns 24319
BM_lexicographical_compare_three_way<int*>/262144 125823 ns 125822 ns 5523
BM_lexicographical_compare_three_way<int*>/1048576 508282 ns 508215 ns 1385
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1 2.68 ns 2.68 ns 261609292
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/4 5.37 ns 5.37 ns 132805153
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/16 12.4 ns 12.4 ns 56251434
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/64 44.5 ns 44.5 ns 15732328
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/256 182 ns 182 ns 3848123
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1024 696 ns 696 ns 1003714
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/4096 2753 ns 2753 ns 254392
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/16384 10966 ns 10966 ns 63844
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/65536 43846 ns 43844 ns 15953
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/262144 175599 ns 175590 ns 3989
BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1048576 705291 ns 705245 ns 991
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1 1.67 ns 1.67 ns 418471062
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4 4.68 ns 4.68 ns 149512136
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16 16.7 ns 16.7 ns 41727011
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/64 65.0 ns 65.0 ns 10776363
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/256 266 ns 266 ns 2635392
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1024 1035 ns 1035 ns 674817
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4096 4121 ns 4121 ns 169881
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16384 16461 ns 16461 ns 42545
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/65536 65776 ns 65774 ns 10630
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/262144 263308 ns 263295 ns 2659
BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1048576 1060793 ns 1060722 ns 660
A couple of observations:
- The results for `{fast,slow}_path/1` vary from run to run. The slow_path is not faster than the fast_path consistently. I think this is primarily variance
- `fast_path/1048576` is twice as fast as `slow_path/1048576`
- The results for `three_way<int*>/1048576 ` and `three_way<int*>/1048576` are inline with the results for `{fast,slow}_path/1048576`. The dispatching based on whether we are passing in a random access iterator or not works as expected
- `three_way<random_access_iterator<int*>>/1048576` is less efficient than `three_way<int*>/1048576`. As shown below, the assembly code is exactly identical between the two, though. So this seems to be due to some microarchitectural shenanigans (maybe code alignment?)
- Compared to the previously posted results, `three_way<int*>` is now faster. I think compared to last time, I was just more lucky this time, and this improvement is thanks to the same microarchitectural shenanigans as the previous point
I recorded the benchmark with `perf record -e cycles:pp ./libcxx/benchmarks/lexicographical_compare_three_way.libcxx.out`.
Below the hot loops of the individual benchmarks as reported by `perf report`:
Hot loop for `BM_lexicographical_compare_three_way_fast_path` and `BM_lexicographical_compare_three_way<int*>`. Both have the exact same assembly and the reported `cycles:pp` perf numbers are pretty identical.
15.10 │210:┌─→cmp %rdi,%r10
│ │↓ je 25a
17.76 │ │ mov 0x4(%rsi,%rdi,4),%r8d
16.45 │ │ mov 0x4(%r15,%rdi,4),%r9d
14.85 │ │ inc %rdi
0.30 │ ├──cmp %r9d,%r8d
15.98 │ └──je 210
Hot loop for `BM_lexicographical_compare_three_way_slow_path` and `BM_lexicographical_compare_three_way<cpp17_input_iterator<int*, int*>>`. Both have the exact same assembly and the reported perf numbers are pretty identical.
0.03 │210:┌─→mov (%r15,%r10,1),%edi
20.82 │ │ cmp %edi,(%rsi,%r10,1)
2.75 │ │↓ jne 240
7.61 │ │ cmp %r10,%rax
│ │ sete %r8b
0.01 │ │ cmp %r10,%r9
20.71 │ │ sete %dil
7.54 │ │↑ je 1d0
│ │ lea 0x4(%r10),%r11
0.01 │ │ cmp %r10,%rax
21.13 │ │ mov %r11,%r10
7.63 │ └──jne 210
Hot loop for `BM_lexicographical_compare_three_way<random_access_iterator<int*> >`. Interestingly, the generated assembly is identitcal with
2.10 │210:┌─→cmp %rdi,%r10
│ │↓ je 25a
35.84 │ │ mov 0x4(%rsi,%rdi,4),%r8d
3.58 │ │ mov 0x4(%r15,%rdi,4),%r9d
4.02 │ │ inc %rdi
0.29 │ ├──cmp %r9d,%r8d
33.22 │ └──je 210
As you can see, the assembly is identical with `BM_lexicographical_compare_three_way_fast_path`. However, the codes executes 40%, and the perf profile has different hot instructions.
My guess is that this is due to some micro-architectural shenanigans.
I am not sure how to further pinpoint this difference between `int*` and `random_access_iterator<int*>`.
Either way, I think we can say for sure: The assembly for the fast path is more efficient than the assembly for the slow path.
There seems to be some performance variability, though, due to some aspects which I don't understand fully.
Are those benchmark results as well as the updated benchmark code satisfactory to you, @var-const?
================
Comment at: libcxx/include/__algorithm/lexicographical_compare_three_way.h:36
+
+ if constexpr (__is_cpp17_random_access_iterator<_InputIterator1>::value &&
+ __is_cpp17_random_access_iterator<_InputIterator2>::value) {
----------------
var-const wrote:
> Optional: consider refactoring both branches into helper functions.
Done. In particular, because this allows me to address your other benchmarking request more easily: I can now directly call the slow and fast paths from the benchmark
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D131395/new/
https://reviews.llvm.org/D131395
More information about the libcxx-commits
mailing list