<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Optimiser misses good optimisations for std::merge"

   href="https://bugs.llvm.org/show_bug.cgi?id=42334">42334</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Optimiser misses good optimisations for std::merge

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>LLVM Codegen

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>denis.yaroshevskij@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Hi.

Current implementation of std::merge can be significantly optimised and Roman

Lebedev (@lebedev.ri) suggested that these changes should be done by the

optimiser.

1) Just check for one boundary on every loop iteration instead of two.

If you look at the code:

<a href="https://github.com/llvm-mirror/libcxx/blob/b82bfabdfd93f8616df8e6510021642e1fd764ed/include/algorithm#L4353">https://github.com/llvm-mirror/libcxx/blob/b82bfabdfd93f8616df8e6510021642e1fd764ed/include/algorithm#L4353</a>

You can see that there is only one iterator getting advanced on every

iteration.

However - the code checks for both boundaries.

If we stop doing that and only check for one boundary - there is up to 30%

speed up on some benchmarks.

Here is a quick-bench with a 1.2 win:

<a href="http://quick-bench.com/UlL4LXNkrDL_rp6QNAFpIqRtsG0">http://quick-bench.com/UlL4LXNkrDL_rp6QNAFpIqRtsG0</a>

Here is the same diff on godbolt: <a href="https://gcc.godbolt.org/z/KpIsXK">https://gcc.godbolt.org/z/KpIsXK</a>

2) Unrolling one iteration for the first range after the check for second is

very beneficial to performance.

Here is the comparison with version 2 on quick-bench:

<a href="http://quick-bench.com/JsNpf5VWYuBftR2vI4SxDqYJt8M">http://quick-bench.com/JsNpf5VWYuBftR2vI4SxDqYJt8M</a> - extra 1.2 times (1.5

compared to std::merge)

Here is a comparison on godbolt: <a href="https://gcc.godbolt.org/z/SP43AH">https://gcc.godbolt.org/z/SP43AH</a>

The reason for this speed up (as I understand it) is that we don't do a jump

when second than first case. + I suspect that the loops tap into a more natural

pattern, like do while for the second range.

There is practically no difference between 1 and 2 optimization in size -

because in second case two memmoves are collapsed into one, which is also nice.

I have very limited understanding of what optimiser is capable and not capable

of doing for me - please take a look if this is possible.

FYI: std::stable_sort should also benefit from this kind of optimisation.

std::stable_sort is a very important algorithm, for example, in Chromium it

ends up used in some expensive places.