[cfe-dev] libc++ Performance (compared to libstdc++)

Fri Jul 1 17:55:46 PDT 2016

Hi everyone,

I was chatting with Marshall offline last week, and I mentioned that several of my users had noted general performance regressions switching from libstdc++ to libc++. Marshall said that he's heard similar things, but has received few specific reports. He did recall looking at the problem which I believe is described here (http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/), which is still a problem. I'll certainly admit that I'd not investigated most of these in detail (with the exception of a std::complex -ffast-math issue, http://reviews.llvm.org/D18639). We do have a few performance-related libc++ bugs open:

https://llvm.org/bugs/show_bug.cgi?id=21192 - Reading from stdin is 1-2 orders of magnitude slower than using libstdc++ [I just tested this myself and updated the bug report].
https://llvm.org/bugs/show_bug.cgi?id=19708 - std::find is significantly slower than libstdc++.
https://llvm.org/bugs/show_bug.cgi?id=20837 - libc++'s std::sort is O(N^2) in the worst case (instead of O(N*ln(N))).
https://llvm.org/bugs/show_bug.cgi?id=26886 - libc++'s std::stable_sort also has a worst-case complexity issue.
https://llvm.org/bugs/show_bug.cgi?id=15456 - A faster implementation of std::function is possible
https://llvm.org/bugs/show_bug.cgi?id=16747 and https://llvm.org/bugs/show_bug.cgi?id=21275 - Our unordered_multimap insert is much slower than libstdc++'s. In PR16747, Howard interestingly explains libc++ has this problem because of an additional (i.e. not-required-by-the-standard) guarantee that libc++ provides regarding member ordering.

but very few are related to containers.

Baptiste Wicht has a benchmark covering use of several common standard algorithms with vectors, lists and dqueues (https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp) which he used for his post http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html, and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3, using both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon E5-2699 v3 @ 2.30GHz running Linux 3.10.0. If you try this yourself, note that even on a fast machine the benchmark takes several hours to run.

$ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu
$ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-llvm

Of the 248 tests, libc++ was faster by at least 5% in 58 of the tests and libstdc++ was faster by at least 5% in 94 of the tests. libc++ was faster by at least 20% in 14 of the tests and libstdc++ was faster by at least 20% in 64 of the tests. The real problem, however, comes from the extremums. libc++ is never more than 65% faster than libstdc++:

destruction___Trivial_128_	list	-0.65
destruction___Trivial_4096_	vector	-0.40
destruction___Trivial_1024_	list	-0.38
destruction___Trivial_1024_	vector	-0.37
random_remove___NonTrivialArray_32_	vector	-0.3

but libc++ is sometimes over 10x slower than libstdc++:

fill_back___NonTrivialStringMovable	list_inserter	9.96
fill_back___NonTrivialStringMovable	vector_reserve	10.21
fill_back___NonTrivialStringMovableNoExcept	vector_reserve	10.82
fill_back___NonTrivialStringMovableNoExcept	vector_inserter	11.15
fill_back___NonTrivialStringMovable	vector_inserter	11.93

I've attached the full list.

A second benchmark, http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp (https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/), modified only to repeat each test 30 instead of 7 times, and compiled as before:

uint32_t std::set erase: -0.37
std::string std::set erase: -0.30
std::string std::set insertion: -0.23
std::string std::unordered_set erase: -0.16
std::string std::unordered_set iterate: -0.15
std::string std::set lookup: -0.15
uint32_t std::set insertion: -0.13
uint32_t std::unordered_set iterate: -0.072
std::string std::set iterate: -0.062
uint32_t std::set iterate: -0.054
uint32_t std::set lookup: -0.015
uint32_t std::unordered_set erase: 0.085
std::string std::unordered_set insertion: 0.22
std::string std::unordered_set lookup: 0.30
uint32_t std::unordered_set insertion: 0.51
uint32_t std::unordered_set lookup: 0.61

In this benchmark, libc++ beats libstdc++ by more than 5% in 10 tests, and libstdc++ beats libc++ by more than 5% in 5 tests. Again, however, libc++'s downside is larger, being up to 61% slower (in the 'uint32_t std::unordered_set lookup' test) than libstdc++. libstdc++ loses only by 37% to libc++, at most, in the 'uint32_t std::set erase' test. Also, I can easily imagine that users are more-likely to notice a performance difference in lookup than in erase.

To pick another benchmark, I compiled and ran the one from http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ - and this must be good because the post ends with, "And remember, if Chandler Carruth and Mike Acton give you advice about data structures, listen to them. ;)". I modified the benchmark only by adding constexpr to min() and max() of XorshiftRNG to make it compile with libc++. This benchmarks many configurations and takes nearly an hour to run. I'll summarize the results I'll say that libc++ is almost always slower than libstdc++, and that as the element size and/or the number of elements increases it gets worse. Here are the relative timing differences; these are tests for unordered_map:

Fill for 8-byte elements, 32-byte elements, 128-byte elements, 1K-byte elements, 4K-byte elements:
100000 -0.022 0.026 0.10 0.057 0.058
200000 -0.028 0.0041 0.12 0.083 0.057
300000 -0.022 -0.023 0.018 0.040 0.035
400000 -0.010 -0.0094 -0.077 0.048 0.049
500000 0.22 0.28 0.18 0.17 0.094
600000 -0.064 -0.081 -0.11 0.0033 0.020
700000 -0.059 -0.043 -0.089 -0.0047 0.027
800000 -0.037 -0.053 -0.072 0.0092 0.035
900000 0.36 0.30 0.20 0.15 0.099
1000000 0.31 0.23 0.17 0.13 0.098

The first column is the number of elements; negative numbers mean libc++ is faster. For 1000000 4K elements, libstdc++ is faster by 9.8%. For 1000000 8-byte elements, libstdc++ is faster by nearly 32%.

Pre-sized fill:
100000 0.024 0.020 0.12 0.091 0.042
200000 0.015 0.016 0.16 0.041 0.084
300000 0.00038 0.033 0.17 0.090 0.090
400000 0.070 0.041 0.069 0.084 0.094
500000 0.069 0.025 0.061 0.12 0.10
600000 -0.0096 0.023 0.016 0.074 0.072
700000 0.060 0.0013 0.036 0.094 0.085
800000 0.029 -0.0048 0.035 0.083 0.075
900000 -0.011 0.0025 -0.037 0.078 0.085
1000000 0.0022 0.019 0.011 0.084 0.081

Time for 100K lookups:
100000 0.12 0.13 0.089 0.11 0.071
200000 0.13 0.12 0.10 0.11 0.072
300000 0.098 0.12 0.053 0.085 0.080
400000 0.18 0.11 0.088 0.059 0.030
500000 0.12 0.080 0.072 0.075 0.033
600000 0.095 0.10 0.076 0.063 0.017
700000 0.17 0.097 0.12 0.083 0.043
800000 0.16 0.13 0.11 0.092 0.050
900000 0.094 0.086 0.056 0.041 -0.0047
1000000 0.14 0.11 0.092 0.073 0.016

Time for 100K failed lookups:
100000 0.15 0.15 0.11 0.087 0.077
200000 0.14 0.11 0.083 0.074 0.059
300000 0.10 0.091 0.090 0.12 0.097
400000 0.10 0.12 0.061 0.11 0.099
500000 0.19 0.20 0.12 0.12 0.12
600000 0.18 0.18 0.12 0.11 0.095
700000 0.082 0.096 0.068 0.079 0.070
800000 0.20 0.19 0.14 0.13 0.12
900000 0.20 0.17 0.10 0.11 0.097
1000000 0.25 0.22 0.17 0.14 0.13

Time to remove half the elements:
100000 0.15 0.15 0.078 0.066 0.066
200000 0.12 0.12 0.039 0.055 0.060
300000 0.12 0.070 0.024 0.038 0.095
400000 0.16 0.16 0.15 0.13 0.13
500000 0.10 0.12 0.12 0.13 0.13
600000 0.077 0.015 0.036 0.026 0.049
700000 0.086 0.16 0.16 0.17 0.14
800000 0.076 0.044 0.051 0.043 0.064
900000 0.11 0.11 0.11 0.12 0.13
1000000 0.10 0.090 0.061 0.081 0.10

Many of our users have code that is sensitive to the performance of standard containers and algorithms, and this preliminary benchmarking lends support to the anecdotes that libc++ is slower than libstdc++. Worryingly, the extremes of these differences are pretty large. Obviously application impact can't be judged by some benchmarks I happened to find on the internet, but this is something we, as a community, should look at more closely.

Thanks again,
Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: psums.txt
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160701/7ca8fff9/attachment.txt>