<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Optimiser misses good optimisations for std::merge"
href="https://bugs.llvm.org/show_bug.cgi?id=42334">42334</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Optimiser misses good optimisations for std::merge
</td>
</tr>
<tr>
<th>Product</th>
<td>clang
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>LLVM Codegen
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedclangbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>denis.yaroshevskij@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org, neeilans@live.com, richard-llvm@metafoo.co.uk
</td>
</tr></table>
<p>
<div>
<pre>Hi.
Current implementation of std::merge can be significantly optimised and Roman
Lebedev (@lebedev.ri) suggested that these changes should be done by the
optimiser.
1) Just check for one boundary on every loop iteration instead of two.
If you look at the code:
<a href="https://github.com/llvm-mirror/libcxx/blob/b82bfabdfd93f8616df8e6510021642e1fd764ed/include/algorithm#L4353">https://github.com/llvm-mirror/libcxx/blob/b82bfabdfd93f8616df8e6510021642e1fd764ed/include/algorithm#L4353</a>
You can see that there is only one iterator getting advanced on every
iteration.
However - the code checks for both boundaries.
If we stop doing that and only check for one boundary - there is up to 30%
speed up on some benchmarks.
Here is a quick-bench with a 1.2 win:
<a href="http://quick-bench.com/UlL4LXNkrDL_rp6QNAFpIqRtsG0">http://quick-bench.com/UlL4LXNkrDL_rp6QNAFpIqRtsG0</a>
Here is the same diff on godbolt: <a href="https://gcc.godbolt.org/z/KpIsXK">https://gcc.godbolt.org/z/KpIsXK</a>
2) Unrolling one iteration for the first range after the check for second is
very beneficial to performance.
Here is the comparison with version 2 on quick-bench:
<a href="http://quick-bench.com/JsNpf5VWYuBftR2vI4SxDqYJt8M">http://quick-bench.com/JsNpf5VWYuBftR2vI4SxDqYJt8M</a> - extra 1.2 times (1.5
compared to std::merge)
Here is a comparison on godbolt: <a href="https://gcc.godbolt.org/z/SP43AH">https://gcc.godbolt.org/z/SP43AH</a>
The reason for this speed up (as I understand it) is that we don't do a jump
when second than first case. + I suspect that the loops tap into a more natural
pattern, like do while for the second range.
There is practically no difference between 1 and 2 optimization in size -
because in second case two memmoves are collapsed into one, which is also nice.
I have very limited understanding of what optimiser is capable and not capable
of doing for me - please take a look if this is possible.
FYI: std::stable_sort should also benefit from this kind of optimisation.
std::stable_sort is a very important algorithm, for example, in Chromium it
ends up used in some expensive places.
Related links:
Pr to change std::merge (unlikely to be accepted since goto is not constexpr
and I can't rewrite 2 without goto): <a href="https://reviews.llvm.org/D63063">https://reviews.llvm.org/D63063</a>
Switch based state machines do not get optimised:
<a class="bz_bug_link
bz_status_NEW "
title="NEW - Optimizing switch to jumps"
href="show_bug.cgi?id=42313">https://bugs.llvm.org/show_bug.cgi?id=42313</a></pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>