[libcxx-commits] [PATCH] D131395: [libc++] Implement `lexicographical_compare_three_way`

Adrian Vogelsgesang via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Sat Nov 19 17:30:24 PST 2022


avogelsgesang marked 3 inline comments as done.
avogelsgesang added a comment.

@var-const As requested, I updated the benchmark to benchmark `int*` on both the fast and the slow path.

Previously, you proposed

> Can you please redo the benchmark in a different way by doing some local (not to be committed) modifications? Rather than comparing the results for int* versus cpp17_input_iterator<int*>, remove the input iterator tests completely, run the int* benchmark against the current implementation, then remove the optimization and rerun the benchmark. That way, it would be a proper "apples-to-apples" comparison: int* vs. int* with or without the optimization. (I don't expect results to change much, but better be sure)

Instead, I now factored out the slow and fast path into two separate functions (as proposed in another comment) and call both of them directly from the benchmark. That way, we can also rerun the benchmark in the future, and e.g. remove the optimization from libc++ again, as soon as clang's optimizations are smart enough to figure this out by themselves. Do you agree with the updated benchmark?

The results on my local machine (compiled using clang commit `aae08b1d372a45b8bef95e86e5fe9110045eb78d`):

  Running ./libcxx/benchmarks/lexicographical_compare_three_way.libcxx.out
  Run on (40 X 800.102 MHz CPU s)
  CPU Caches:
    L1 Data 32 KiB (x20)
    L1 Instruction 32 KiB (x20)
    L2 Unified 1024 KiB (x20)
    L3 Unified 14080 KiB (x2)
  Load Average: 0.00, 0.48, 0.67
  ---------------------------------------------------------------------------------------------------------------------
  Benchmark                                                                           Time             CPU   Iterations
  ---------------------------------------------------------------------------------------------------------------------
  BM_lexicographical_compare_three_way_slow_path/1                                 1.67 ns         1.67 ns    415947081
  BM_lexicographical_compare_three_way_slow_path/4                                 4.68 ns         4.68 ns    149519461
  BM_lexicographical_compare_three_way_slow_path/16                                16.7 ns         16.7 ns     41824440
  BM_lexicographical_compare_three_way_slow_path/64                                64.9 ns         64.9 ns     10711142
  BM_lexicographical_compare_three_way_slow_path/256                                266 ns          266 ns      2630928
  BM_lexicographical_compare_three_way_slow_path/1024                              1036 ns         1036 ns       676167
  BM_lexicographical_compare_three_way_slow_path/4096                              4122 ns         4121 ns       169817
  BM_lexicographical_compare_three_way_slow_path/16384                            16453 ns        16453 ns        42559
  BM_lexicographical_compare_three_way_slow_path/65536                            65827 ns        65826 ns        10619
  BM_lexicographical_compare_three_way_slow_path/262144                          263147 ns       263140 ns         2658
  BM_lexicographical_compare_three_way_slow_path/1048576                        1060909 ns      1060813 ns          659
  BM_lexicographical_compare_three_way_fast_path/1                                 2.34 ns         2.34 ns    298968436
  BM_lexicographical_compare_three_way_fast_path/4                                 4.76 ns         4.76 ns    148425440
  BM_lexicographical_compare_three_way_fast_path/16                                10.1 ns         10.1 ns     68735401
  BM_lexicographical_compare_three_way_fast_path/64                                28.8 ns         28.8 ns     24262243
  BM_lexicographical_compare_three_way_fast_path/256                                123 ns          123 ns      5685144
  BM_lexicographical_compare_three_way_fast_path/1024                               444 ns          444 ns      1575722
  BM_lexicographical_compare_three_way_fast_path/4096                              1749 ns         1749 ns       400100
  BM_lexicographical_compare_three_way_fast_path/16384                             6898 ns         6897 ns       101189
  BM_lexicographical_compare_three_way_fast_path/65536                            28537 ns        28536 ns        24520
  BM_lexicographical_compare_three_way_fast_path/262144                          125793 ns       125791 ns         5503
  BM_lexicographical_compare_three_way_fast_path/1048576                         506020 ns       505976 ns         1379
  BM_lexicographical_compare_three_way<int*>/1                                     1.67 ns         1.67 ns    418741799
  BM_lexicographical_compare_three_way<int*>/4                                     3.68 ns         3.68 ns    190328759
  BM_lexicographical_compare_three_way<int*>/16                                    8.70 ns         8.70 ns     80464881
  BM_lexicographical_compare_three_way<int*>/64                                    28.9 ns         28.9 ns     24261824
  BM_lexicographical_compare_three_way<int*>/256                                    125 ns          125 ns      5593129
  BM_lexicographical_compare_three_way<int*>/1024                                   448 ns          448 ns      1562365
  BM_lexicographical_compare_three_way<int*>/4096                                  1751 ns         1750 ns       399745
  BM_lexicographical_compare_three_way<int*>/16384                                 6895 ns         6895 ns       101193
  BM_lexicographical_compare_three_way<int*>/65536                                28741 ns        28740 ns        24319
  BM_lexicographical_compare_three_way<int*>/262144                              125823 ns       125822 ns         5523
  BM_lexicographical_compare_three_way<int*>/1048576                             508282 ns       508215 ns         1385
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1             2.68 ns         2.68 ns    261609292
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/4             5.37 ns         5.37 ns    132805153
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/16            12.4 ns         12.4 ns     56251434
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/64            44.5 ns         44.5 ns     15732328
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/256            182 ns          182 ns      3848123
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1024           696 ns          696 ns      1003714
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/4096          2753 ns         2753 ns       254392
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/16384        10966 ns        10966 ns        63844
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/65536        43846 ns        43844 ns        15953
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/262144      175599 ns       175590 ns         3989
  BM_lexicographical_compare_three_way<random_access_iterator<int*>>/1048576     705291 ns       705245 ns          991
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1               1.67 ns         1.67 ns    418471062
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4               4.68 ns         4.68 ns    149512136
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16              16.7 ns         16.7 ns     41727011
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/64              65.0 ns         65.0 ns     10776363
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/256              266 ns          266 ns      2635392
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1024            1035 ns         1035 ns       674817
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/4096            4121 ns         4121 ns       169881
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/16384          16461 ns        16461 ns        42545
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/65536          65776 ns        65774 ns        10630
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/262144        263308 ns       263295 ns         2659
  BM_lexicographical_compare_three_way<cpp17_input_iterator<int*>>/1048576      1060793 ns      1060722 ns          660

A couple of observations:

- The results for `{fast,slow}_path/1` vary from run to run. The slow_path is not faster than the fast_path consistently. I think this is primarily variance
- `fast_path/1048576` is twice as fast as `slow_path/1048576`
- The results for `three_way<int*>/1048576 ` and `three_way<int*>/1048576` are inline with the results for `{fast,slow}_path/1048576`. The dispatching based on whether we are passing in a random access iterator or not works as expected
- `three_way<random_access_iterator<int*>>/1048576` is less efficient than `three_way<int*>/1048576`. As shown below, the assembly code is exactly identical between the two, though. So this seems to be due to some microarchitectural shenanigans (maybe code alignment?)
- Compared to the previously posted results, `three_way<int*>` is now faster. I think compared to last time, I was just more lucky this time, and this improvement is thanks to the same microarchitectural shenanigans as the previous point

I recorded the benchmark with `perf record -e cycles:pp ./libcxx/benchmarks/lexicographical_compare_three_way.libcxx.out`.
Below the hot loops of the individual benchmarks as reported by `perf report`:

Hot loop for `BM_lexicographical_compare_three_way_fast_path` and `BM_lexicographical_compare_three_way<int*>`. Both have the exact same assembly and the reported `cycles:pp` perf numbers are pretty identical.

  15.10 │210:┌─→cmp    %rdi,%r10 
        │    │↓ je     25a
  17.76 │    │  mov    0x4(%rsi,%rdi,4),%r8d
  16.45 │    │  mov    0x4(%r15,%rdi,4),%r9d
  14.85 │    │  inc    %rdi
   0.30 │    ├──cmp    %r9d,%r8d
  15.98 │    └──je     210

Hot loop for `BM_lexicographical_compare_three_way_slow_path` and `BM_lexicographical_compare_three_way<cpp17_input_iterator<int*, int*>>`. Both have the exact same assembly and the reported perf numbers are pretty identical.

   0.03 │210:┌─→mov    (%r15,%r10,1),%edi
  20.82 │    │  cmp    %edi,(%rsi,%r10,1) 
   2.75 │    │↓ jne    240 
   7.61 │    │  cmp    %r10,%rax
        │    │  sete   %r8b
   0.01 │    │  cmp    %r10,%r9
  20.71 │    │  sete   %dil
   7.54 │    │↑ je     1d0
        │    │  lea    0x4(%r10),%r11
   0.01 │    │  cmp    %r10,%rax
  21.13 │    │  mov    %r11,%r10
   7.63 │    └──jne    210

Hot loop for `BM_lexicographical_compare_three_way<random_access_iterator<int*> >`. Interestingly, the generated assembly is identitcal with

   2.10 │210:┌─→cmp    %rdi,%r10 
        │    │↓ je     25a
  35.84 │    │  mov    0x4(%rsi,%rdi,4),%r8d
   3.58 │    │  mov    0x4(%r15,%rdi,4),%r9d
   4.02 │    │  inc    %rdi
   0.29 │    ├──cmp    %r9d,%r8d
  33.22 │    └──je     210

As you can see, the assembly is identical with `BM_lexicographical_compare_three_way_fast_path`. However, the codes executes 40%, and the perf profile has different hot instructions.
My guess is that this is due to some micro-architectural shenanigans.

I am not sure how to further pinpoint this difference between `int*` and `random_access_iterator<int*>`.
Either way, I think we can say for sure: The assembly for the fast path is more efficient than the assembly for the slow path.
There seems to be some performance variability, though, due to some aspects which I don't understand fully.

Are those benchmark results as well as the updated benchmark code satisfactory to you, @var-const?



================
Comment at: libcxx/include/__algorithm/lexicographical_compare_three_way.h:36
+
+  if constexpr (__is_cpp17_random_access_iterator<_InputIterator1>::value &&
+                __is_cpp17_random_access_iterator<_InputIterator2>::value) {
----------------
var-const wrote:
> Optional: consider refactoring both branches into helper functions.
Done. In particular, because this allows me to address your other benchmarking request more easily: I can now directly call the slow and fast paths from the benchmark


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131395/new/

https://reviews.llvm.org/D131395



More information about the libcxx-commits mailing list