<div dir="ltr">Hi Hal,<div><br></div><div>Thanks for putting this together. I'll start looking at the benchmarks and seeing what I can do.</div><div>Here are my drive by thoughts (more to come tomorrow)</div><div><br></div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 1, 2016 at 6:55 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi everyone,<br>

<br>

I was chatting with Marshall offline last week, and I mentioned that several of my users had noted general performance regressions switching from libstdc++ to libc++. Marshall said that he's heard similar things, but has received few specific reports. He did recall looking at the problem which I believe is described here (<a href="http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/" rel="noreferrer" target="_blank">http://aras-p.info/blog/2015/12/11/careful-with-that-stl-map-insert-eugene/</a>), which is still a problem. I'll certainly admit that I'd not investigated most of these in detail (with the exception of a std::complex -ffast-math issue, <a href="http://reviews.llvm.org/D18639" rel="noreferrer" target="_blank">http://reviews.llvm.org/D18639</a>). We do have a few performance-related libc++ bugs open:<br>

<br>

<a href="https://llvm.org/bugs/show_bug.cgi?id=21192" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=21192</a> - Reading from stdin is 1-2 orders of magnitude slower than using libstdc++ [I just tested this myself and updated the bug report].<br>

<a href="https://llvm.org/bugs/show_bug.cgi?id=19708" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=19708</a> - std::find is significantly slower than libstdc++.<br>

<a href="https://llvm.org/bugs/show_bug.cgi?id=20837" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=20837</a> - libc++'s std::sort is O(N^2) in the worst case (instead of O(N*ln(N))).<br>

<a href="https://llvm.org/bugs/show_bug.cgi?id=26886" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=26886</a> - libc++'s std::stable_sort also has a worst-case complexity issue.<br>

<a href="https://llvm.org/bugs/show_bug.cgi?id=15456" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=15456</a> - A faster implementation of std::function is possible<br>

<a href="https://llvm.org/bugs/show_bug.cgi?id=16747" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=16747</a> and <a href="https://llvm.org/bugs/show_bug.cgi?id=21275" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_bug.cgi?id=21275</a> - Our unordered_multimap insert is much slower than libstdc++'s. In PR16747, Howard interestingly explains libc++ has this problem because of an additional (i.e. not-required-by-the-standard) guarantee that libc++ provides regarding member ordering.<br>

<br>

but very few are related to containers.<br>

<br>

Baptiste Wicht has a benchmark covering use of several common standard algorithms with vectors, lists and dqueues (<a href="https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp" rel="noreferrer" target="_blank">https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp</a>) which he used for his post <a href="http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html" rel="noreferrer" target="_blank">http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html</a>, and I've compiled this using LLVM/Clang/libc++ r271873 @ -O3, using both libc++ and libstdc++ 4.8.5, and run on an Intel Xeon E5-2699 v3 @ 2.30GHz running Linux 3.10.0. If you try this yourself, note that even on a fast machine the benchmark takes several hours to run.<br>

<br>

$ clang++ -std=c++11 -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-gnu<br>

$ clang++ -std=c++11 -stdlib=libc++ -O3 -I../../include -I../../../boost_1_61_0 bench.cpp ../demangle.cpp ../graphs.cpp -o /tmp/b-llvm<br>

<br>

Of the 248 tests, libc++ was faster by at least 5% in 58 of the tests and libstdc++ was faster by at least 5% in 94 of the tests. libc++ was faster by at least 20% in 14 of the tests and libstdc++ was faster by at least 20% in 64 of the tests. The real problem, however, comes from the extremums. libc++ is never more than 65% faster than libstdc++:<br>

<br>

destruction___Trivial_128_      list    -0.65<br>

destruction___Trivial_4096_     vector  -0.40<br>

destruction___Trivial_1024_     list    -0.38<br>

destruction___Trivial_1024_     vector  -0.37<br>

random_remove___NonTrivialArray_32_     vector  -0.3<br></blockquote><div><br></div><div>Whoo hoo!</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

but libc++ is sometimes over 10x slower than libstdc++:<br>

<br>

fill_back___NonTrivialStringMovable     list_inserter   9.96<br>

fill_back___NonTrivialStringMovable     vector_reserve  10.21<br>

fill_back___NonTrivialStringMovableNoExcept     vector_reserve  10.82<br>

fill_back___NonTrivialStringMovableNoExcept     vector_inserter 11.15<br>

fill_back___NonTrivialStringMovable     vector_inserter 11.93<br></blockquote><div><br></div><div><br></div><div>This is almost certainly testing against GCC's COW string, which is obviously going to be a lot faster.</div><div>I looked at the tests and it looks like the strings are being copied (not moved) into the vector, so the name is a bit of the type is a bit of a misnomer.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

I've attached the full list.<br>

<br>

A second benchmark, <a href="http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp" rel="noreferrer" target="_blank">http://beta.visl.sdu.dk/svn/visl/tools/benchmarks/src/set.cpp</a> (<a href="https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/" rel="noreferrer" target="_blank">https://tinodidriksen.com/2012/02/20/cpp-set-performance-2/</a>), modified only to repeat each test 30 instead of 7 times, and compiled as before:<br>

<br>

uint32_t std::set erase: -0.37<br>

std::string std::set erase: -0.30<br>

std::string std::set insertion: -0.23<br>

std::string std::unordered_set erase: -0.16<br>

std::string std::unordered_set iterate: -0.15<br>

std::string std::set lookup: -0.15<br>

uint32_t std::set insertion: -0.13<br>

uint32_t std::unordered_set iterate: -0.072<br>

std::string std::set iterate: -0.062<br>

uint32_t std::set iterate: -0.054<br>

uint32_t std::set lookup: -0.015<br>

uint32_t std::unordered_set erase: 0.085<br>

std::string std::unordered_set insertion: 0.22<br>

std::string std::unordered_set lookup: 0.30<br>

uint32_t std::unordered_set insertion: 0.51<br>

uint32_t std::unordered_set lookup: 0.61<br>

<br>

In this benchmark, libc++ beats libstdc++ by more than 5% in 10 tests, and libstdc++ beats libc++ by more than 5% in 5 tests. Again, however, libc++'s downside is larger, being up to 61% slower (in the 'uint32_t std::unordered_set lookup' test) than libstdc++. libstdc++ loses only by 37% to libc++, at most, in the 'uint32_t std::set erase' test. Also, I can easily imagine that users are more-likely to notice a performance difference in lookup than in erase.<br></blockquote><div><br></div><div>I can't reproduce the unordered_set<uint32_t> insertion results, but I could reproduce the 60% difference for lookup and I managed to significantly reduce it.</div><div>I checked in the optimization in r274423 which results in a ~45% speedup for unordered_set<...>::find.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

To pick another benchmark, I compiled and ran the one from <a href="http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/" rel="noreferrer" target="_blank">http://www.reedbeta.com/blog/2015/01/12/data-oriented-hash-table/</a> - and this must be good because the post ends with, "And remember, if Chandler Carruth and Mike Acton give you advice about data structures, listen to them. ;)". I modified the benchmark only by adding constexpr to min() and max() of XorshiftRNG to make it compile with libc++. This benchmarks many configurations and takes nearly an hour to run. I'll summarize the results I'll say that libc++ is almost always slower than libstdc++, and that as the element size and/or the number of elements increases it gets worse. Here are the relative timing differences; these are tests for unordered_map:<br>

<br>

Fill for 8-byte elements, 32-byte elements, 128-byte elements, 1K-byte elements, 4K-byte elements:<br>

100000 -0.022 0.026 0.10 0.057 0.058<br>

200000 -0.028 0.0041 0.12 0.083 0.057<br>

300000 -0.022 -0.023 0.018 0.040 0.035<br>

400000 -0.010 -0.0094 -0.077 0.048 0.049<br>

500000 0.22 0.28 0.18 0.17 0.094<br>

600000 -0.064 -0.081 -0.11 0.0033 0.020<br>

700000 -0.059 -0.043 -0.089 -0.0047 0.027<br>

800000 -0.037 -0.053 -0.072 0.0092 0.035<br>

900000 0.36 0.30 0.20 0.15 0.099<br>

1000000 0.31 0.23 0.17 0.13 0.098<br>

<br>

The first column is the number of elements; negative numbers mean libc++ is faster. For 1000000 4K elements, libstdc++ is faster by 9.8%. For 1000000 8-byte elements, libstdc++ is faster by nearly 32%.<br>

<br>

Pre-sized fill:<br>

100000 0.024 0.020 0.12 0.091 0.042<br>

200000 0.015 0.016 0.16 0.041 0.084<br>

300000 0.00038 0.033 0.17 0.090 0.090<br>

400000 0.070 0.041 0.069 0.084 0.094<br>

500000 0.069 0.025 0.061 0.12 0.10<br>

600000 -0.0096 0.023 0.016 0.074 0.072<br>

700000 0.060 0.0013 0.036 0.094 0.085<br>

800000 0.029 -0.0048 0.035 0.083 0.075<br>

900000 -0.011 0.0025 -0.037 0.078 0.085<br>

1000000 0.0022 0.019 0.011 0.084 0.081<br>

<br>

Time for 100K lookups:<br>

100000 0.12 0.13 0.089 0.11 0.071<br>

200000 0.13 0.12 0.10 0.11 0.072<br>

300000 0.098 0.12 0.053 0.085 0.080<br>

400000 0.18 0.11 0.088 0.059 0.030<br>

500000 0.12 0.080 0.072 0.075 0.033<br>

600000 0.095 0.10 0.076 0.063 0.017<br>

700000 0.17 0.097 0.12 0.083 0.043<br>

800000 0.16 0.13 0.11 0.092 0.050<br>

900000 0.094 0.086 0.056 0.041 -0.0047<br>

1000000 0.14 0.11 0.092 0.073 0.016<br>

<br>

Time for 100K failed lookups:<br>

100000 0.15 0.15 0.11 0.087 0.077<br>

200000 0.14 0.11 0.083 0.074 0.059<br>

300000 0.10 0.091 0.090 0.12 0.097<br>

400000 0.10 0.12 0.061 0.11 0.099<br>

500000 0.19 0.20 0.12 0.12 0.12<br>

600000 0.18 0.18 0.12 0.11 0.095<br>

700000 0.082 0.096 0.068 0.079 0.070<br>

800000 0.20 0.19 0.14 0.13 0.12<br>

900000 0.20 0.17 0.10 0.11 0.097<br>

1000000 0.25 0.22 0.17 0.14 0.13<br>

<br>

Time to remove half the elements:<br>

100000 0.15 0.15 0.078 0.066 0.066<br>

200000 0.12 0.12 0.039 0.055 0.060<br>

300000 0.12 0.070 0.024 0.038 0.095<br>

400000 0.16 0.16 0.15 0.13 0.13<br>

500000 0.10 0.12 0.12 0.13 0.13<br>

600000 0.077 0.015 0.036 0.026 0.049<br>

700000 0.086 0.16 0.16 0.17 0.14<br>

800000 0.076 0.044 0.051 0.043 0.064<br>

900000 0.11 0.11 0.11 0.12 0.13<br>

1000000 0.10 0.090 0.061 0.081 0.10<br>

<br>

Many of our users have code that is sensitive to the performance of standard containers and algorithms, and this preliminary benchmarking lends support to the anecdotes that libc++ is slower than libstdc++. Worryingly, the extremes of these differences are pretty large. Obviously application impact can't be judged by some benchmarks I happened to find on the internet, but this is something we, as a community, should look at more closely.<br>

<br>

Thanks again,<br>

Hal<br>

<span class=""><font color="#888888"><br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</font></span></blockquote></div><br></div></div></div>