[cfe-dev] libc++ Performance (compared to libstdc++)

Hal Finkel via cfe-dev cfe-dev at lists.llvm.org
Sun Jul 3 08:47:06 PDT 2016


----- Original Message -----

> From: "Eric Fiselier" <eric at efcs.ca>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "cfe-dev" <cfe-dev at lists.llvm.org>, "Marshall Clow"
> <mclow.lists at gmail.com>
> Sent: Saturday, July 2, 2016 1:03:53 AM
> Subject: Re: libc++ Performance (compared to libstdc++)

> Hi Hal,

> Thanks for putting this together. I'll start looking at the
> benchmarks and seeing what I can do.
> Here are my drive by thoughts (more to come tomorrow)

> On Fri, Jul 1, 2016 at 6:55 PM, Hal Finkel < hfinkel at anl.gov > wrote:

> > Hi everyone,
> 

> > I was chatting with Marshall offline last week, and I mentioned
> > that
> > several of my users had noted general performance regressions
> > switching from libstdc++ to libc++. Marshall said that he's heard
> > similar things, but has received few specific reports. He did
> > recall
> > looking at the problem which I believe is described here (
> > http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/
> > ), which is still a problem. I'll certainly admit that I'd not
> > investigated most of these in detail (with the exception of a
> > std::complex -ffast-math issue, http://reviews.llvm.org/D18639 ).
> > We
> > do have a few performance-related libc++ bugs open:
> 

> > https://llvm.org/bugs/show_bug.cgi?id=21192 - Reading from stdin is
> > 1-2 orders of magnitude slower than using libstdc++ [I just tested
> > this myself and updated the bug report].
> 
> > https://llvm.org/bugs/show_bug.cgi?id=19708 - std::find is
> > significantly slower than libstdc++.
> 
> > https://llvm.org/bugs/show_bug.cgi?id=20837 - libc++'s std::sort is
> > O(N^2) in the worst case (instead of O(N*ln(N))).
> 
> > https://llvm.org/bugs/show_bug.cgi?id=26886 - libc++'s
> > std::stable_sort also has a worst-case complexity issue.
> 
> > https://llvm.org/bugs/show_bug.cgi?id=15456 - A faster
> > implementation
> > of std::function is possible
> 
> > https://llvm.org/bugs/show_bug.cgi?id=16747 and
> > https://llvm.org/bugs/show_bug.cgi?id=21275 - Our
> > unordered_multimap
> > insert is much slower than libstdc++'s. In PR16747, Howard
> > interestingly explains libc++ has this problem because of an
> > additional (i.e. not-required-by-the-standard) guarantee that
> > libc++
> > provides regarding member ordering.
> 

> > but very few are related to containers.
> 

> > Baptiste Wicht has a benchmark covering use of several common
> > standard algorithms with vectors, lists and dqueues (
> > https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp
> > ) which he used for his post
> > http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html
> > , and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3,
> > using both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon
> > E5-2699 v3 @ 2.30GHz running Linux 3.10.0. If you try this
> > yourself,
> > note that even on a fast machine the benchmark takes several hours
> > to run.
> 

> > $ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0
> > bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu
> 
> > $ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include
> > -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o
> > /tmp/b-llvm
> 

> > Of the 248 tests, libc++ was faster by at least 5% in 58 of the
> > tests
> > and libstdc++ was faster by at least 5% in 94 of the tests. libc++
> > was faster by at least 20% in 14 of the tests and libstdc++ was
> > faster by at least 20% in 64 of the tests. The real problem,
> > however, comes from the extremums. libc++ is never more than 65%
> > faster than libstdc++:
> 

> > destruction___Trivial_128_ list -0.65
> 
> > destruction___Trivial_4096_ vector -0.40
> 
> > destruction___Trivial_1024_ list -0.38
> 
> > destruction___Trivial_1024_ vector -0.37
> 
> > random_remove___NonTrivialArray_32_ vector -0.3
> 

> Whoo hoo!

> > but libc++ is sometimes over 10x slower than libstdc++:
> 

> > fill_back___NonTrivialStringMovable list_inserter 9.96
> 
> > fill_back___NonTrivialStringMovable vector_reserve 10.21
> 
> > fill_back___NonTrivialStringMovableNoExcept vector_reserve 10.82
> 
> > fill_back___NonTrivialStringMovableNoExcept vector_inserter 11.15
> 
> > fill_back___NonTrivialStringMovable vector_inserter 11.93
> 

> This is almost certainly testing against GCC's COW string, which is
> obviously going to be a lot faster.
> I looked at the tests and it looks like the strings are being copied
> (not moved) into the vector, so the name is a bit of the type is a
> bit of a misnomer.

Indeed. I had forgotten that libstdc++ uses COW strings (I imagine that if I were to test using the GCC 5+ implementation we might see a different result). 

> > I've attached the full list.
> 

> > A second benchmark,
> > http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp (
> > https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/ ),
> > modified only to repeat each test 30 instead of 7 times, and
> > compiled as before:
> 

> > uint32_t std::set erase: -0.37
> 
> > std::string std::set erase: -0.30
> 
> > std::string std::set insertion: -0.23
> 
> > std::string std::unordered_set erase: -0.16
> 
> > std::string std::unordered_set iterate: -0.15
> 
> > std::string std::set lookup: -0.15
> 
> > uint32_t std::set insertion: -0.13
> 
> > uint32_t std::unordered_set iterate: -0.072
> 
> > std::string std::set iterate: -0.062
> 
> > uint32_t std::set iterate: -0.054
> 
> > uint32_t std::set lookup: -0.015
> 
> > uint32_t std::unordered_set erase: 0.085
> 
> > std::string std::unordered_set insertion: 0.22
> 
> > std::string std::unordered_set lookup: 0.30
> 
> > uint32_t std::unordered_set insertion: 0.51
> 
> > uint32_t std::unordered_set lookup: 0.61
> 

> > In this benchmark, libc++ beats libstdc++ by more than 5% in 10
> > tests, and libstdc++ beats libc++ by more than 5% in 5 tests.
> > Again,
> > however, libc++'s downside is larger, being up to 61% slower (in
> > the
> > 'uint32_t std::unordered_set lookup' test) than libstdc++.
> > libstdc++
> > loses only by 37% to libc++, at most, in the 'uint32_t std::set
> > erase' test. Also, I can easily imagine that users are more-likely
> > to notice a performance difference in lookup than in erase.
> 

> I can't reproduce the unordered_set<uint32_t> insertion results, but
> I could reproduce the 60% difference for lookup and I managed to
> significantly reduce it.
> I checked in the optimization in r274423 which results in a ~45%
> speedup for unordered_set<...>::find.
Great! I re-ran this test and now see this: 

uint32_t std::set erase: -0.35 
std::string std::set erase: -0.32 
std::string std::set insertion: -0.23 
std::string std::unordered_set erase: -0.17 
std::string std::set lookup: -0.16 
std::string std::unordered_set iterate: -0.15 
uint32_t std::set insertion: -0.12 
std::string std::set iterate: -0.070 
uint32_t std::unordered_set iterate: -0.022 
uint32_t std::set lookup: 0.0065 
uint32_t std::set iterate: 0.0072 
uint32_t std::unordered_set lookup: 0.030 
uint32_t std::unordered_set erase: 0.072 
std::string std::unordered_set lookup: 0.12 
std::string std::unordered_set insertion: 0.20 
uint32_t std::unordered_set insertion: 0.54 

This is significantly better. I still see the insertion issue, however, and I confirmed that I still see that difference even if I strip down the benchmark only to test std::unordered_set<uint32_t> insertion. I'll file a separate bug report for this issue. 

Thanks again, 
Hal 

> > To pick another benchmark, I compiled and ran the one from
> > http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/ -
> > and this must be good because the post ends with, "And remember, if
> > Chandler Carruth and Mike Acton give you advice about data
> > structures, listen to them. ;)". I modified the benchmark only by
> > adding constexpr to min() and max() of XorshiftRNG to make it
> > compile with libc++. This benchmarks many configurations and takes
> > nearly an hour to run. I'll summarize the results I'll say that
> > libc++ is almost always slower than libstdc++, and that as the
> > element size and/or the number of elements increases it gets worse.
> > Here are the relative timing differences; these are tests for
> > unordered_map:
> 

> > Fill for 8-byte elements, 32-byte elements, 128-byte elements,
> > 1K-byte elements, 4K-byte elements:
> 
> > 100000 -0.022 0.026 0.10 0.057 0.058
> 
> > 200000 -0.028 0.0041 0.12 0.083 0.057
> 
> > 300000 -0.022 -0.023 0.018 0.040 0.035
> 
> > 400000 -0.010 -0.0094 -0.077 0.048 0.049
> 
> > 500000 0.22 0.28 0.18 0.17 0.094
> 
> > 600000 -0.064 -0.081 -0.11 0.0033 0.020
> 
> > 700000 -0.059 -0.043 -0.089 -0.0047 0.027
> 
> > 800000 -0.037 -0.053 -0.072 0.0092 0.035
> 
> > 900000 0.36 0.30 0.20 0.15 0.099
> 
> > 1000000 0.31 0.23 0.17 0.13 0.098
> 

> > The first column is the number of elements; negative numbers mean
> > libc++ is faster. For 1000000 4K elements, libstdc++ is faster by
> > 9.8%. For 1000000 8-byte elements, libstdc++ is faster by nearly
> > 32%.
> 

> > Pre-sized fill:
> 
> > 100000 0.024 0.020 0.12 0.091 0.042
> 
> > 200000 0.015 0.016 0.16 0.041 0.084
> 
> > 300000 0.00038 0.033 0.17 0.090 0.090
> 
> > 400000 0.070 0.041 0.069 0.084 0.094
> 
> > 500000 0.069 0.025 0.061 0.12 0.10
> 
> > 600000 -0.0096 0.023 0.016 0.074 0.072
> 
> > 700000 0.060 0.0013 0.036 0.094 0.085
> 
> > 800000 0.029 -0.0048 0.035 0.083 0.075
> 
> > 900000 -0.011 0.0025 -0.037 0.078 0.085
> 
> > 1000000 0.0022 0.019 0.011 0.084 0.081
> 

> > Time for 100K lookups:
> 
> > 100000 0.12 0.13 0.089 0.11 0.071
> 
> > 200000 0.13 0.12 0.10 0.11 0.072
> 
> > 300000 0.098 0.12 0.053 0.085 0.080
> 
> > 400000 0.18 0.11 0.088 0.059 0.030
> 
> > 500000 0.12 0.080 0.072 0.075 0.033
> 
> > 600000 0.095 0.10 0.076 0.063 0.017
> 
> > 700000 0.17 0.097 0.12 0.083 0.043
> 
> > 800000 0.16 0.13 0.11 0.092 0.050
> 
> > 900000 0.094 0.086 0.056 0.041 -0.0047
> 
> > 1000000 0.14 0.11 0.092 0.073 0.016
> 

> > Time for 100K failed lookups:
> 
> > 100000 0.15 0.15 0.11 0.087 0.077
> 
> > 200000 0.14 0.11 0.083 0.074 0.059
> 
> > 300000 0.10 0.091 0.090 0.12 0.097
> 
> > 400000 0.10 0.12 0.061 0.11 0.099
> 
> > 500000 0.19 0.20 0.12 0.12 0.12
> 
> > 600000 0.18 0.18 0.12 0.11 0.095
> 
> > 700000 0.082 0.096 0.068 0.079 0.070
> 
> > 800000 0.20 0.19 0.14 0.13 0.12
> 
> > 900000 0.20 0.17 0.10 0.11 0.097
> 
> > 1000000 0.25 0.22 0.17 0.14 0.13
> 

> > Time to remove half the elements:
> 
> > 100000 0.15 0.15 0.078 0.066 0.066
> 
> > 200000 0.12 0.12 0.039 0.055 0.060
> 
> > 300000 0.12 0.070 0.024 0.038 0.095
> 
> > 400000 0.16 0.16 0.15 0.13 0.13
> 
> > 500000 0.10 0.12 0.12 0.13 0.13
> 
> > 600000 0.077 0.015 0.036 0.026 0.049
> 
> > 700000 0.086 0.16 0.16 0.17 0.14
> 
> > 800000 0.076 0.044 0.051 0.043 0.064
> 
> > 900000 0.11 0.11 0.11 0.12 0.13
> 
> > 1000000 0.10 0.090 0.061 0.081 0.10
> 

> > Many of our users have code that is sensitive to the performance of
> > standard containers and algorithms, and this preliminary
> > benchmarking lends support to the anecdotes that libc++ is slower
> > than libstdc++. Worryingly, the extremes of these differences are
> > pretty large. Obviously application impact can't be judged by some
> > benchmarks I happened to find on the internet, but this is
> > something
> > we, as a community, should look at more closely.
> 

> > Thanks again,
> 
> > Hal
> 

> > --
> 
> > Hal Finkel
> 
> > Assistant Computational Scientist
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 

-- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160703/090e18ea/attachment.html>


More information about the cfe-dev mailing list