[libcxx-commits] [libcxx] [libc++] Tiny optimizations for is_permutation (PR #129565)
Louis Dionne via libcxx-commits
libcxx-commits at lists.llvm.org
Tue Mar 25 14:03:59 PDT 2025
https://github.com/ldionne commented:
First, make sure to benchmark on the latest `main` since I recently fixed two issues where we wouldn't vectorize properly inside `mismatch`. I pulled your branch and rebased it onto `main` just now, and the algorithms I get that do worse are the following (I dropped all the lines where your patch was an improvement):
```
Comparing build/default/libcxx/test/benchmarks/algorithms/nonmodifying/Output/is_permutation.bench.cpp.dir/benchmark-result.json to build/candidate/libcxx/test/benchmarks/algorithms/nonmodifying/Output/is_permutation.bench.cpp.dir/benchmark-result.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
std::is_permutation(list<int>) (3leg) (common prefix)/8 +0.0801 +0.0824 5 5 5 5
std::is_permutation(list<int>) (3leg) (common prefix)/1024 +0.4706 +0.4731 1088 1599 1086 1599
std::is_permutation(list<int>) (3leg) (common prefix)/8192 +0.1887 +0.1899 11456 13618 11445 13618
std::is_permutation(list<int>) (3leg, pred) (common prefix)/1024 +0.0336 +0.0338 1137 1176 1137 1175
rng::is_permutation(list<int>) (4leg, pred) (common prefix)/8192 +0.1133 +0.1139 12519 13938 12512 13937
std::is_permutation(list<int>) (3leg) (shuffled)/8 +0.0699 +0.0701 61 65 61 65
std::is_permutation(list<int>) (3leg, pred) (shuffled)/1024 +0.0281 +0.0289 2234288 2297102 2232489 2296954
std::is_permutation(list<int>) (4leg) (shuffled)/8 +0.0703 +0.0723 61 65 61 65
rng::is_permutation(list<int>) (4leg) (shuffled)/8 +0.0819 +0.0818 60 65 60 65
std::is_permutation(list<int>) (4leg, pred) (shuffled)/8 +0.2659 +0.2659 76 96 76 96
rng::is_permutation(list<int>) (4leg, pred) (shuffled)/8 +0.2426 +0.2472 77 96 77 96
std::is_permutation(deque<int>) (4leg) (common prefix)/8 +0.2592 +0.2600 12 15 12 15
std::is_permutation(deque<int>) (4leg) (common prefix)/1024 +0.5906 +0.5919 818 1301 817 1301
std::is_permutation(deque<int>) (4leg) (common prefix)/8192 +0.5919 +0.5925 6456 10277 6453 10276
rng::is_permutation(deque<int>) (4leg) (common prefix)/8 +0.4652 +0.4659 10 15 10 15
rng::is_permutation(deque<int>) (4leg) (common prefix)/1024 +0.5856 +0.5861 812 1288 812 1288
rng::is_permutation(deque<int>) (4leg) (common prefix)/8192 +0.5919 +0.5921 6443 10256 6441 10255
std::is_permutation(deque<int>) (4leg, pred) (common prefix)/8 +0.3581 +0.3591 11 16 11 16
std::is_permutation(deque<int>) (4leg, pred) (common prefix)/1024 +0.4977 +0.4986 862 1291 861 1291
std::is_permutation(deque<int>) (4leg, pred) (common prefix)/8192 +0.5084 +0.5102 6826 10296 6817 10295
rng::is_permutation(deque<int>) (4leg, pred) (common prefix)/8 +0.3754 +0.3763 11 16 11 16
rng::is_permutation(deque<int>) (4leg, pred) (common prefix)/1024 +0.5137 +0.5142 852 1290 852 1290
rng::is_permutation(deque<int>) (4leg, pred) (common prefix)/8192 +0.5136 +0.5144 6771 10248 6767 10248
std::is_permutation(deque<int>) (3leg) (shuffled)/8 +0.0634 +0.0639 73 78 73 78
std::is_permutation(deque<int>) (3leg, pred) (shuffled)/8 +0.0260 +0.0275 81 83 81 83
rng::is_permutation(deque<int>) (4leg, pred) (shuffled)/8 +0.3511 +0.3530 81 109 81 109
std::is_permutation(vector<int>) (3leg, pred) (common prefix)/8 +0.0183 +0.0172 4 4 4 4
std::is_permutation(vector<int>) (3leg) (shuffled)/8 +0.1508 +0.1512 49 56 49 56
std::is_permutation(vector<int>) (3leg, pred) (shuffled)/8 +0.0585 +0.0604 62 65 62 65
std::is_permutation(vector<int>) (4leg) (shuffled)/8 +0.0977 +0.0989 49 54 49 54
rng::is_permutation(vector<int>) (4leg) (shuffled)/8 +0.1512 +0.1525 49 56 49 56
std::is_permutation(vector<int>) (4leg, pred) (shuffled)/8 +0.0705 +0.0721 61 66 61 66
```
- First, we can observe that `vector<int>` is only doing worse on very small sequences. That's actually a particularity of this benchmark, it operates on pretty small sequences since `is_permutation` is so expensive. I think we can mostly disregard the slowdown for `vector<int>` since it only affects 8 element sequences. I suspect that making our vectorized `mismatch` faster on small sequences would solve the problem here.
- Second, we can see that we're doing worse on several benchmarks that check the `common prefix` pattern. But with that data pattern, the algorithm should be dominated by `mismatch`. So I think we need to understand why our current `std::mismatch` behaves worse on `std::deque` than the hand-written loop that existed in `std::is_permutation` before your patch. I think you could also validate that switching from the hand-written loop to `std::mismatch` is the cause of the slowdown by locally reverting just that part of the change and seeing if the before/after benchmarks are better for `std::deque` on `common prefix`. BTW you can locally edit the benchmark to only run a subset of all the combinations in order to iterate more quickly.
- Last, we are also doing worse on `list` with the common prefix pattern, I suspect we might be hitting the same issue as `deque`.
So TLDR, I'd focus on confirming that `std::mismatch` is slower on `deque` and `list` than a naive hand-written loop, and go from there.
https://github.com/llvm/llvm-project/pull/129565
More information about the libcxx-commits
mailing list