[cfe-dev] libc++ Performance (compared to libstdc++)
Hal Finkel via cfe-dev
cfe-dev at lists.llvm.org
Tue Jul 5 09:50:32 PDT 2016
----- Original Message -----
> From: "Ben via cfe-dev Craig" <cfe-dev at lists.llvm.org>
> To: cfe-dev at lists.llvm.org
> Sent: Tuesday, July 5, 2016 11:40:16 AM
> Subject: Re: [cfe-dev] libc++ Performance (compared to libstdc++)
>
> If you're looking for "fair" comparisons between the library portions
> of
> libc++ and libstdc++, it may make more sense to use g++ as the
> compiler
> instead. I'm pretty sure that libc++ works with g++, and I also
> believe
> that the libstdc++ that ships with gcc 5.0+ no longer uses the
> non-conforming COW strings. I would suggest using clang++ with a
> newer
> libstdc++, but I believe that isn't currently supported due to some
> ABI
> tag weirdness.
Good point. Also, using the newer libstdc++ with Clang should now be supported in Clang trunk as of a few days ago (the ABI tag attribute support landed).
-Hal
>
>
> On 7/2/2016 5:03 AM, Dennis Luehring via cfe-dev wrote:
> > these benchmarks could be a good starting point for an
> > permanten/commit-based libc++ performance regression
> > test like LLD got
> > (http://lists.llvm.org/pipermail/llvm-dev/2016-January/094132.html)
> >
> > Am 02.07.2016 um 02:55 schrieb Hal Finkel via cfe-dev:
> >> Hi everyone,
> >>
> >> I was chatting with Marshall offline last week, and I mentioned
> >> that
> >> several of my users had noted general performance regressions
> >> switching from libstdc++ to libc++. Marshall said that he's heard
> >> similar things, but has received few specific reports. He did
> >> recall
> >> looking at the problem which I believe is described here
> >> (http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/),
> >> which is still a problem. I'll certainly admit that I'd not
> >> investigated most of these in detail (with the exception of a
> >> std::complex -ffast-math issue, http://reviews.llvm.org/D18639).
> >> We
> >> do have a few performance-related libc++ bugs open:
> >>
> >> https://llvm.org/bugs/show_bug.cgi?id=21192 - Reading from stdin
> >> is
> >> 1-2 orders of magnitude slower than using libstdc++ [I just tested
> >> this myself and updated the bug report].
> >> https://llvm.org/bugs/show_bug.cgi?id=19708 - std::find is
> >> significantly slower than libstdc++.
> >> https://llvm.org/bugs/show_bug.cgi?id=20837 - libc++'s std::sort
> >> is
> >> O(N^2) in the worst case (instead of O(N*ln(N))).
> >> https://llvm.org/bugs/show_bug.cgi?id=26886 - libc++'s
> >> std::stable_sort also has a worst-case complexity issue.
> >> https://llvm.org/bugs/show_bug.cgi?id=15456 - A faster
> >> implementation
> >> of std::function is possible
> >> https://llvm.org/bugs/show_bug.cgi?id=16747 and
> >> https://llvm.org/bugs/show_bug.cgi?id=21275 - Our
> >> unordered_multimap
> >> insert is much slower than libstdc++'s. In PR16747, Howard
> >> interestingly explains libc++ has this problem because of an
> >> additional (i.e. not-required-by-the-standard) guarantee that
> >> libc++
> >> provides regarding member ordering.
> >>
> >> but very few are related to containers.
> >>
> >> Baptiste Wicht has a benchmark covering use of several common
> >> standard algorithms with vectors, lists and dqueues
> >> (https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp)
> >> which he used for his post
> >> http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html,
> >> and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3,
> >> using
> >> both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon E5-2699
> >> v3
> >> @ 2.30GHz running Linux 3.10.0. If you try this yourself, note
> >> that
> >> even on a fast machine the benchmark takes several hours to run.
> >>
> >> $ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0
> >> bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu
> >> $ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include
> >> -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o
> >> /tmp/b-llvm
> >>
> >> Of the 248 tests, libc++ was faster by at least 5% in 58 of the
> >> tests
> >> and libstdc++ was faster by at least 5% in 94 of the tests. libc++
> >> was faster by at least 20% in 14 of the tests and libstdc++ was
> >> faster by at least 20% in 64 of the tests. The real problem,
> >> however,
> >> comes from the extremums. libc++ is never more than 65% faster
> >> than
> >> libstdc++:
> >>
> >> destruction___Trivial_128_ list -0.65
> >> destruction___Trivial_4096_ vector -0.40
> >> destruction___Trivial_1024_ list -0.38
> >> destruction___Trivial_1024_ vector -0.37
> >> random_remove___NonTrivialArray_32_ vector -0.3
> >>
> >> but libc++ is sometimes over 10x slower than libstdc++:
> >>
> >> fill_back___NonTrivialStringMovable list_inserter 9.96
> >> fill_back___NonTrivialStringMovable vector_reserve 10.21
> >> fill_back___NonTrivialStringMovableNoExcept vector_reserve
> >> 10.82
> >> fill_back___NonTrivialStringMovableNoExcept vector_inserter
> >> 11.15
> >> fill_back___NonTrivialStringMovable vector_inserter 11.93
> >>
> >> I've attached the full list.
> >>
> >> A second benchmark,
> >> http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp
> >> (https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/),
> >> modified only to repeat each test 30 instead of 7 times, and
> >> compiled
> >> as before:
> >>
> >> uint32_t std::set erase: -0.37
> >> std::string std::set erase: -0.30
> >> std::string std::set insertion: -0.23
> >> std::string std::unordered_set erase: -0.16
> >> std::string std::unordered_set iterate: -0.15
> >> std::string std::set lookup: -0.15
> >> uint32_t std::set insertion: -0.13
> >> uint32_t std::unordered_set iterate: -0.072
> >> std::string std::set iterate: -0.062
> >> uint32_t std::set iterate: -0.054
> >> uint32_t std::set lookup: -0.015
> >> uint32_t std::unordered_set erase: 0.085
> >> std::string std::unordered_set insertion: 0.22
> >> std::string std::unordered_set lookup: 0.30
> >> uint32_t std::unordered_set insertion: 0.51
> >> uint32_t std::unordered_set lookup: 0.61
> >>
> >> In this benchmark, libc++ beats libstdc++ by more than 5% in 10
> >> tests, and libstdc++ beats libc++ by more than 5% in 5 tests.
> >> Again,
> >> however, libc++'s downside is larger, being up to 61% slower (in
> >> the
> >> 'uint32_t std::unordered_set lookup' test) than libstdc++.
> >> libstdc++
> >> loses only by 37% to libc++, at most, in the 'uint32_t std::set
> >> erase' test. Also, I can easily imagine that users are more-likely
> >> to
> >> notice a performance difference in lookup than in erase.
> >>
> >> To pick another benchmark, I compiled and ran the one from
> >> http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/
> >> -
> >> and this must be good because the post ends with, "And remember,
> >> if
> >> Chandler Carruth and Mike Acton give you advice about data
> >> structures, listen to them. ;)". I modified the benchmark only by
> >> adding constexpr to min() and max() of XorshiftRNG to make it
> >> compile
> >> with libc++. This benchmarks many configurations and takes nearly
> >> an
> >> hour to run. I'll summarize the results I'll say that libc++ is
> >> almost always slower than libstdc++, and that as the element size
> >> and/or the number of elements increases it gets worse. Here are
> >> the
> >> relative timing differences; these are tests for unordered_map:
> >>
> >> Fill for 8-byte elements, 32-byte elements, 128-byte elements,
> >> 1K-byte elements, 4K-byte elements:
> >> 100000 -0.022 0.026 0.10 0.057 0.058
> >> 200000 -0.028 0.0041 0.12 0.083 0.057
> >> 300000 -0.022 -0.023 0.018 0.040 0.035
> >> 400000 -0.010 -0.0094 -0.077 0.048 0.049
> >> 500000 0.22 0.28 0.18 0.17 0.094
> >> 600000 -0.064 -0.081 -0.11 0.0033 0.020
> >> 700000 -0.059 -0.043 -0.089 -0.0047 0.027
> >> 800000 -0.037 -0.053 -0.072 0.0092 0.035
> >> 900000 0.36 0.30 0.20 0.15 0.099
> >> 1000000 0.31 0.23 0.17 0.13 0.098
> >>
> >> The first column is the number of elements; negative numbers mean
> >> libc++ is faster. For 1000000 4K elements, libstdc++ is faster by
> >> 9.8%. For 1000000 8-byte elements, libstdc++ is faster by nearly
> >> 32%.
> >>
> >> Pre-sized fill:
> >> 100000 0.024 0.020 0.12 0.091 0.042
> >> 200000 0.015 0.016 0.16 0.041 0.084
> >> 300000 0.00038 0.033 0.17 0.090 0.090
> >> 400000 0.070 0.041 0.069 0.084 0.094
> >> 500000 0.069 0.025 0.061 0.12 0.10
> >> 600000 -0.0096 0.023 0.016 0.074 0.072
> >> 700000 0.060 0.0013 0.036 0.094 0.085
> >> 800000 0.029 -0.0048 0.035 0.083 0.075
> >> 900000 -0.011 0.0025 -0.037 0.078 0.085
> >> 1000000 0.0022 0.019 0.011 0.084 0.081
> >>
> >> Time for 100K lookups:
> >> 100000 0.12 0.13 0.089 0.11 0.071
> >> 200000 0.13 0.12 0.10 0.11 0.072
> >> 300000 0.098 0.12 0.053 0.085 0.080
> >> 400000 0.18 0.11 0.088 0.059 0.030
> >> 500000 0.12 0.080 0.072 0.075 0.033
> >> 600000 0.095 0.10 0.076 0.063 0.017
> >> 700000 0.17 0.097 0.12 0.083 0.043
> >> 800000 0.16 0.13 0.11 0.092 0.050
> >> 900000 0.094 0.086 0.056 0.041 -0.0047
> >> 1000000 0.14 0.11 0.092 0.073 0.016
> >>
> >> Time for 100K failed lookups:
> >> 100000 0.15 0.15 0.11 0.087 0.077
> >> 200000 0.14 0.11 0.083 0.074 0.059
> >> 300000 0.10 0.091 0.090 0.12 0.097
> >> 400000 0.10 0.12 0.061 0.11 0.099
> >> 500000 0.19 0.20 0.12 0.12 0.12
> >> 600000 0.18 0.18 0.12 0.11 0.095
> >> 700000 0.082 0.096 0.068 0.079 0.070
> >> 800000 0.20 0.19 0.14 0.13 0.12
> >> 900000 0.20 0.17 0.10 0.11 0.097
> >> 1000000 0.25 0.22 0.17 0.14 0.13
> >>
> >> Time to remove half the elements:
> >> 100000 0.15 0.15 0.078 0.066 0.066
> >> 200000 0.12 0.12 0.039 0.055 0.060
> >> 300000 0.12 0.070 0.024 0.038 0.095
> >> 400000 0.16 0.16 0.15 0.13 0.13
> >> 500000 0.10 0.12 0.12 0.13 0.13
> >> 600000 0.077 0.015 0.036 0.026 0.049
> >> 700000 0.086 0.16 0.16 0.17 0.14
> >> 800000 0.076 0.044 0.051 0.043 0.064
> >> 900000 0.11 0.11 0.11 0.12 0.13
> >> 1000000 0.10 0.090 0.061 0.081 0.10
> >>
> >> Many of our users have code that is sensitive to the performance
> >> of
> >> standard containers and algorithms, and this preliminary
> >> benchmarking
> >> lends support to the anecdotes that libc++ is slower than
> >> libstdc++.
> >> Worryingly, the extremes of these differences are pretty large.
> >> Obviously application impact can't be judged by some benchmarks I
> >> happened to find on the internet, but this is something we, as a
> >> community, should look at more closely.
> >>
> >> Thanks again,
> >> Hal
> >>
> >>
> >>
> >> _______________________________________________
> >> cfe-dev mailing list
> >> cfe-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >
> >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
More information about the cfe-dev
mailing list