[cfe-dev] Potential optimization for std::inplace_merge

Mon Oct 31 10:56:52 PDT 2016

----- Original Message -----

> From: "Morwenn Ed via cfe-dev" <cfe-dev at lists.llvm.org>
> To: cfe-dev at lists.llvm.org
> Sent: Thursday, October 27, 2016 2:12:39 PM
> Subject: [cfe-dev] Potential optimization for std::inplace_merge

> I have been working on a sorting library for a while, and it happens
> to contain reimplementations of many standard algorithms borrowed
> from libc++ (mostly because I needed to add a few more features to
> standard algorithms, akin to those in range-v3). I was working on a
> top-down in-place mergesort implementation, and I believe that I
> have found a potential albeit small optimization for
> std::inplace_merge , specifically in the helper function
> half_inplace_merge . Here is the original function borrowed from
> libc++ (coding style has been changed, don't pay attention to that):

> template<typename Compare, typename InputIterator1,
> typename InputIterator2, typename OutputIterator>
> auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
> InputIterator2 first2, InputIterator2 last2,
> OutputIterator result, Compare comp)
> -> void
> {
> for (; first1 != last1 ; ++result) {
> if (first2 == last2) {
> std::move(first1, last1, result);
> return;
> }

> if (comp(*first2, *first1)) {
> *result = std::move(*first2);
> ++first2;
> } else {
> *result = std::move(*first1);
> ++first1;
> }
> }
> // first2 through last2 are already in the right spot.
> }

> I made a simple observation: at every loop iteration, we check
> whether first1 is different from last1 and whether first2 is equal
> to last2 . However, we know that we will loop at least min(last1 -
> first1, last2 - first2) times before actually needing to check
> whether first1 == last1 or first2 == last2 , so why not take
> advantage of that by comparing against an integer counter and
> skipping a few iterator comparisons? Moreover, the value min(last1 -
> first1, last2 - first2) is already known when half_inplace_merge is
> called, so we can directly pass them to the function? The modified
> function looked like this:

> template<typename Compare, typename InputIterator1,
> typename InputIterator2, typename OutputIterator,
> typename Size>
> auto half_inplace_merge(InputIterator1 first1, InputIterator1 last1,
> InputIterator2 first2, InputIterator2 last2,
> OutputIterator result, Size min_len,
> Compare comp)
> -> void
> {
> for (; min_len != 0 ; --min_len) {
> if (comp(*first2, *first1)) {
> *result = std::move(*first2);
> ++first2;
> } else {
> *result = std::move(*first1);
> ++first1;
> }
> ++result;
> }

> for (; first1 != last1 ; ++result) {
> if (first2 == last2) {
> std::move(first1, last1, result);
> return;
> }

> if (comp(*first2, *first1)) {
> *result = std::move(*first2);
> ++first2;
> } else {
> *result = std::move(*first1);
> ++first1;
> }
> }
> // first2 through last2 are already in the right spot.
> }

> However, at this point, the new "optimization" made the code
> dramatically slower than it was before when sorting a vector of one
> million integers with a mergesort built on the top of
> std::inplace_merge . It kind of baffled me until I finally noticed
> that the second loop had some information that the first didn't
> have: the second loop knows that first1 != last1 and first2 != last2
> while it runs. However, it is apparently possible to give that
> information to the first loop too. I defined the following macro:

> #define ASSUME(cond) do { if (!(cond)) __builtin_unreachable(); }
> while(0)
FYI, for Clang, you'll need to use __builtin_assume(cond) (http://clang.llvm.org/docs/LanguageExtensions.html#builtin-functions). 

-Hal 

> Then I changed the first loop (the "optimization") as follows:

> fo r (; min_len != 0 ; --min_len) {

> ASSUME(first1 != last1);
> ASSUME(first2 != last2);

> if (comp(*first2, *first1)) {
> *result = std::move(*first2);
> ++first2;
> } else {
> *result = std::move(*first1);
> ++first1;
> }
> ++result;
> }

> This additional information solved the "dramatically slower" problem,
> and then running the benchmarks again showed that the mergesort
> based on the modified half_inplace_merge was generally slightly
> faster than the old one (sometimes the difference is lost in the
> noise, but there are also cases where the difference is neat albeit
> small). I guess that the ASSUME macro actually solved an aliasing
> problem which wasn't present in the second loop due to its loop
> conditions.

> Anyway, I benchmarked that code using MinGW GCC 6.2, so the results
> might be different with Clang. I wanted to tell that small
> optimization story, and felt that I needed to share it here since I
> borrowed the original code from libc++. Now, you can use it as you
> want if you feel that the small performance gain proves to be
> consistent and worth it (IIRC you have benchmarks for some standard
> algorithms).

> The code used for the benchmark is a modified version of the
> following:
> https://github.com/Morwenn/cpp-sort/blob/master/benchmarks/bench.cpp
> Here is a graph with the results I got; merge_sort-new uses the
> modified half_inplace_merge (the names on the left correspond to
> different data distributions used to test the sorting algorithms):
> https://i.imgur.com/hwzJ84U.

> With plenty of love <3

> Morwenn

> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20161031/3ea361fd/attachment.html>