[llvm-bugs] [Bug 42334] New: Optimiser misses good optimisations for std::merge

Wed Jun 19 12:36:01 PDT 2019

https://bugs.llvm.org/show_bug.cgi?id=42334

            Bug ID: 42334
           Summary: Optimiser misses good optimisations for std::merge
           Product: clang
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: LLVM Codegen
          Assignee: unassignedclangbugs at nondot.org
          Reporter: denis.yaroshevskij at gmail.com
                CC: llvm-bugs at lists.llvm.org, neeilans at live.com,
                    richard-llvm at metafoo.co.uk

Hi.

Current implementation of std::merge can be significantly optimised and Roman
Lebedev (@lebedev.ri) suggested that these changes should be done by the
optimiser.

1) Just check for one boundary on every loop iteration instead of two.

If you look at the code:
https://github.com/llvm-mirror/libcxx/blob/b82bfabdfd93f8616df8e6510021642e1fd764ed/include/algorithm#L4353
You can see that there is only one iterator getting advanced on every
iteration.
However - the code checks for both boundaries.

If we stop doing that and only check for one boundary - there is up to 30%
speed up on some benchmarks.
Here is a quick-bench with a 1.2 win:
http://quick-bench.com/UlL4LXNkrDL_rp6QNAFpIqRtsG0

Here is the same diff on godbolt: https://gcc.godbolt.org/z/KpIsXK

2) Unrolling one iteration for the first range after the check for second is
very beneficial to performance.
Here is the comparison with version 2 on quick-bench:
http://quick-bench.com/JsNpf5VWYuBftR2vI4SxDqYJt8M - extra 1.2 times (1.5
compared to std::merge)

Here is a comparison on godbolt: https://gcc.godbolt.org/z/SP43AH

The reason for this speed up (as I understand it) is that we don't do a jump
when second than first case. + I suspect that the loops tap into a more natural
pattern, like do while for the second range.
There is practically no difference between 1 and 2 optimization in size -
because in second case two memmoves are collapsed into one, which is also nice.

I have very limited understanding of what optimiser is capable and not capable
of doing for me - please take a look if this is possible.

FYI: std::stable_sort should also benefit from this kind of optimisation.
std::stable_sort is a very important algorithm, for example, in Chromium it
ends up used in some expensive places.

Related links:
Pr to change std::merge (unlikely to be accepted since goto is not constexpr
and I can't rewrite 2 without goto): https://reviews.llvm.org/D63063
Switch based state machines do not get optimised:
https://bugs.llvm.org/show_bug.cgi?id=42313

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20190619/c76321a5/attachment.html>