[libcxx-commits] [libcxx] [libc++][ranges] optimize the performance of `ranges::starts_with` (PR #84570)

Fri Apr 26 14:33:01 PDT 2024

xiaoyang-sde wrote:

I wrote a benchmark that compares the performance of `std::ranges::equal` and `std::ranges::mismatch`. I ran it on 2 machines and I observed different results.

```cpp
#include <algorithm>
#include <benchmark/benchmark.h>
#include <vector>

#include "test_iterators.h"

static void bm_starts_with_contiguous_iter_with_equal_impl(benchmark::State& state) {
  std::vector<int> a(state.range(), 1);
  std::vector<int> p(state.range(), 1);

  for (auto _ : state) {
    benchmark::DoNotOptimize(a);
    benchmark::DoNotOptimize(p);

    auto begin1 = contiguous_iterator(a.data());
    auto end1   = contiguous_iterator(a.data() + a.size());
    auto begin2 = contiguous_iterator(p.data());
    auto end2   = contiguous_iterator(p.data() + p.size());

    benchmark::DoNotOptimize(std::ranges::equal(begin1, end1, begin2, end2));
  }
}
BENCHMARK(bm_starts_with_contiguous_iter_with_equal_impl)->RangeMultiplier(16)->Range(16, 16 << 20);

static void bm_starts_with_contiguous_iter_with_mismatch_impl(benchmark::State& state) {
  std::vector<int> a(state.range(), 1);
  std::vector<int> p(state.range(), 1);

  for (auto _ : state) {
    benchmark::DoNotOptimize(a);
    benchmark::DoNotOptimize(p);

    auto begin1 = contiguous_iterator(a.data());
    auto end1   = contiguous_iterator(a.data() + a.size());
    auto begin2 = contiguous_iterator(p.data());
    auto end2   = contiguous_iterator(p.data() + p.size());

    benchmark::DoNotOptimize(std::ranges::mismatch(begin1, end1, begin2, end2).in2 == end2);
  }
}
BENCHMARK(bm_starts_with_contiguous_iter_with_mismatch_impl)->RangeMultiplier(16)->Range(16, 16 << 20);

BENCHMARK_MAIN();
```

The performance is similar on MacBook Air (M1, arm64):

```console
2024-04-26T17:14:19-04:00
Running ./build/libcxx/benchmarks/ranges_starts_with.libcxx.out
Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.00, 3.59, 3.41
-----------------------------------------------------------------------------------------------------
Benchmark                                                           Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_equal_impl/16                2.89 ns         2.86 ns    244051251
bm_starts_with_contiguous_iter_with_equal_impl/256               27.1 ns         26.1 ns     26963107
bm_starts_with_contiguous_iter_with_equal_impl/4096               451 ns          443 ns      1578169
bm_starts_with_contiguous_iter_with_equal_impl/65536             6693 ns         6667 ns       104706
bm_starts_with_contiguous_iter_with_equal_impl/1048576         110306 ns       109800 ns         6162
bm_starts_with_contiguous_iter_with_equal_impl/16777216       3129511 ns      2780193 ns          311
bm_starts_with_contiguous_iter_with_mismatch_impl/16             3.08 ns         3.05 ns    225504566
bm_starts_with_contiguous_iter_with_mismatch_impl/256            26.9 ns         26.7 ns     25911339
bm_starts_with_contiguous_iter_with_mismatch_impl/4096            422 ns          420 ns      1678504
bm_starts_with_contiguous_iter_with_mismatch_impl/65536          6834 ns         6722 ns       105055
bm_starts_with_contiguous_iter_with_mismatch_impl/1048576      124471 ns       123355 ns         5691
bm_starts_with_contiguous_iter_with_mismatch_impl/16777216    2337331 ns      2326288 ns          288
```

However, the performance is different on Arch Linux with a 4th Gen Xeon processor (avx2, x86_64):

```console
2024-04-26T20:45:12+00:00
Running ./build/libcxx/benchmarks/ranges_starts_with.libcxx.out
Run on (4 X 2294.61 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 4096 KiB (x4)
Load Average: 0.00, 0.32, 0.66
-----------------------------------------------------------------------------------------------------
Benchmark                                                           Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------
bm_starts_with_contiguous_iter_with_equal_impl/16                7.68 ns         7.68 ns     93071510
bm_starts_with_contiguous_iter_with_equal_impl/256               31.4 ns         31.4 ns     25166061
bm_starts_with_contiguous_iter_with_equal_impl/4096               396 ns          396 ns      1528737
bm_starts_with_contiguous_iter_with_equal_impl/65536            10798 ns        10798 ns        59100
bm_starts_with_contiguous_iter_with_equal_impl/1048576         496691 ns       496671 ns         1499
bm_starts_with_contiguous_iter_with_equal_impl/16777216      13436051 ns     13435049 ns           50
bm_starts_with_contiguous_iter_with_mismatch_impl/16             10.7 ns         10.7 ns     68709479
bm_starts_with_contiguous_iter_with_mismatch_impl/256            59.0 ns         59.0 ns     10459829
bm_starts_with_contiguous_iter_with_mismatch_impl/4096           1069 ns         1069 ns       729445
bm_starts_with_contiguous_iter_with_mismatch_impl/65536         16881 ns        16880 ns        34519
bm_starts_with_contiguous_iter_with_mismatch_impl/1048576      583530 ns       583395 ns         1250
bm_starts_with_contiguous_iter_with_mismatch_impl/16777216   15792555 ns     15791353 ns           43
```

https://github.com/llvm/llvm-project/pull/84570