[libcxx-commits] [libcxx] [libc++] Introduce one-sided binary search for lower_bound on non-random iterators, and use that to improve the average complexity of set_intersection. (PR #75230)

Wed Jan 24 08:48:05 PST 2024

ichaer wrote:

@philnik777, @mordante, perhaps you'll have stumbled on this in the past: I was getting surprising results while running benchmarks, for instance I would consistently measure improvements across the board for `std::vector<uint32_t>`, but worse performance throughout for `std::vector<uint64_t>`. So I went run `perf stat` filtering for those specific test cases to check if it really was cache locality (my best guess), and results were completely different! Performance of my version was shown as better in the huge majority of the cases. What I then did was collect test names and re-run the process for each of them separately, and after that results started making sense to me.

Have you seen this happening before? I found one bug report for Google Benchmark which might potentially be related, https://github.com/google/benchmark/issues/461, but it's unresolved and very unclear. Another one is https://github.com/google/benchmark/issues/1469, for which `--benchmark_enable_random_interleaving` was created, but to use that I'd have to move away from internal batching, which is the normal way of doing it in the libc++ benchmarks I looked at, and which seems to be overall better -- I was doing some debugging and found the benchmark going through init steps again when using `--benchmark_repetitions`.

In any case, time-wise, benchmark results are largely positive, with worse results predominantly when (1) the cost of iterating is greater than the cost of comparison (`std::set` and cheap-to-compare types), and (2) when both input sets are identical, in which case all the additional machinery is overhead.

In terms of operation counts, however... I think the results speak for themselves =D. I'm obviously biased, but, to me, this is a thing of beauty. If the priority is scalability, this looks like an easy choice to me.

I'm attaching results I've tabulated in the way I described, filtering such that each is executed in an independent process. The results labeled as "original" are from 142e567cf008, "onesided" is from the current tip of this branch. I've also executed benchmarks twice for each version, to collect timings without the instrumentation required for the additional operation counters.

[comparison.ods](https://github.com/llvm/llvm-project/files/14041313/comparison.ods)
[comparison.pdf](https://github.com/llvm/llvm-project/files/14041314/comparison.pdf)

https://github.com/llvm/llvm-project/pull/75230