[libcxx-commits] [PATCH] D118029: Introduce branchless sorting functions for sort3, sort4 and sort5.

Wed Mar 23 13:42:35 PDT 2022

ldionne added a comment.

Thanks for the ping! I still have some comments, I'm still trying to make sure I understand all the implications of this change since `std::sort` is such an important (and non-trivial as we can see) algorithm!

================
Comment at: libcxx/benchmarks/algorithms.bench.cpp:27
 template <class V>
-using Value = std::conditional_t<
-    V() == ValueType::Uint32, uint32_t,
-    std::conditional_t<
-        V() == ValueType::Uint64, uint64_t,
-        std::conditional_t<
-            V() == ValueType::Pair, std::pair<uint32_t, uint32_t>,
-            std::conditional_t<V() == ValueType::Tuple,
-                               std::tuple<uint32_t, uint64_t, uint32_t>,
-                               std::string> > > >;
+using Value = std::tuple_element_t<(int)V::value, Types>;

----------------

================
Comment at: libcxx/include/__algorithm/sort.h:118
                     swap(*__x1, *__x2);
                     ++__r;
                 }
----------------
Commenting here for lack of better place: With this patch, I think we need to add tests for `std::sort` for a random access but non-contiguous range, otherwise we may end up testing. And for sorting on non arithmetic types, too. It looks like our `std::sort` tests might be lacking quite a bit, now that I look at them.

================
Comment at: libcxx/include/__algorithm/sort.h:154-155
+
+template <class _Compare, class _RandomAccessIterator>
+inline _LIBCPP_HIDE_FROM_ABI __enable_if_t<__use_branchless_sort<_RandomAccessIterator>::value, void>
+__sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,
----------------
We usually use the following idiom for SFINAE (this is supported in C++03 as a Clang extension):

```
template <class _Compare, class _RandomAccessIterator, __enable_if_t<
    __use_branchless_sort<_RandomAccessIterator>::value
>* = 0>
inline _LIBCPP_HIDE_FROM_ABI
void __sort3_maybe_branchless(...) { ... }
```

It has the benefit that the return type is not hidden in the `__enable_if_t`.

================
Comment at: libcxx/include/__algorithm/sort.h:127
+template <class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
+using __use_branchless_sort = bool_constant< __is_cpp17_contiguous_iterator<_Iter>::value &&
+                                             sizeof(_Tp) <= sizeof(void*) && is_arithmetic<_Tp>::value >;
----------------
philnik wrote:
> marcogelmi wrote:
> > bool_constant seems to need >= c++14, any suggestion on how to make this work for c++03?
> use `_LIBCPP_BOOL_CONSTANT`
I would just go for `integral_constant<bool, EXPR>`. I actually had to look up `_LIBCPP_BOOL_CONSTANT`, and IMO it does not pull its weight. I'll upload a patch to remove it and we can discuss its fate there.

================
Comment at: libcxx/include/__algorithm/sort.h:160
+inline _LIBCPP_HIDDEN
+    __enable_if_t<!_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
marcogelmi wrote:
> Quuxplusone wrote:
> > Quuxplusone wrote:
> > > marcogelmi wrote:
> > > > philnik wrote:
> > > > > marcogelmi wrote:
> > > > > > marcogelmi wrote:
> > > > > > > Quuxplusone wrote:
> > > > > > > > philnik wrote:
> > > > > > > > > Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.
> > > > > > > > `is_scalar` might be relevant.
> > > > > > > this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
> > > > > > switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9
> > > > > Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.
> > > > so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below). 
> > > > 
> > > > I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1
> > > > 
> > > > I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...
> > > > 
> > > > Results on ARM:
> > > > 
> > > > ```
> > > > name                            old cpu/op  new cpu/op  delta
> > > > BM_Sort_optional_Random_1       4.02ns ± 0%  4.01ns ± 1%   -0.36%   (p=0.000 n=99+98)
> > > > BM_Sort_optional_Random_3       5.33ns ± 0%  5.46ns ± 0%   +2.42%   (p=0.000 n=94+93)
> > > > BM_Sort_optional_Random_4       5.39ns ± 0%  6.17ns ± 0%  +14.45%   (p=0.000 n=99+99)
> > > > BM_Sort_optional_Random_5       6.47ns ± 0%  7.35ns ± 0%  +13.54%   (p=0.000 n=99+98)
> > > > BM_Sort_optional_Random_16      10.2ns ± 1%  10.8ns ± 2%   +5.74%  (p=0.000 n=95+100)
> > > > ```
> > > > 
> > > FWIW, I think it would be worth pulling out the condition into a named trait, like
> > > ```
> > > template<class _Tp>
> > > using __use_branchless_sort = bool_constant<
> > >   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> > > >;
> > > ```
> > > All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator*`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.)
> > > 
> > > Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to //contiguous// iterators?
> > > ```
> > > template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
> > > using __use_branchless_sort = bool_constant<
> > >   __is_cpp17_contiguous_iterator<_Iter> &&
> > >   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> > > >;
> > > ```
> > > Also note I keep trying to expand `arithmetic` to `scalar`. :)
> > > Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.
> > 
> > Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like
> > ```
> > int is[1000];
> > Widget *ps[1000];
> > std::sort(is, is+1000, std::less<>);  // A
> > std::sort(ps, ps+1000, std::less<>);  // B
> > ```
> > but poorly on complicated predicates like
> > ```
> > Widget ws[10];
> > std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; });  // C
> > std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; });  // D
> > ```
> > That is, you expect that this optimization, applied across the board, would measurably //speed up// A and B, but measurably //slow down// C and D.  And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much //less// common than D.  So, your patch intentionally applies only to A and C, and excludes B and D.  Is that all correct?
> > I do still think it'd be interesting to find out whether this patch slows down C and D.
> You are correct. Our patch applies to A and B and we are pretty certain that our changes will slow down C and D, because our branchless version calls the comparator more times in the average case. This is why we have limited ourselves to A and B. Note that our patch not only applies to ints but also has similar performance improvements for floats. @ldionne it would be great to get your thoughts on the patch as to whether we can submit as is? Thanks in advance.
Let me just try to state this in the most naive way possible to make sure I'm on the same page as you folks:

Basically, you assume that sorts on arithmetic types over contiguous ranges will not be using complicated comparators, and that you'll save more by avoiding branches and performing more calls to `compare` than reducing the number of calls to `compare` but having more branches. However, **if** we have a contiguous range of arithmetic types **but** the `compare` is extremely expensive, this optimization will actually be a pessimization over the naive approach. Is that correct?

If that is correct, then basically in an ideal world, what you'd want would be to enable this optimization only when you know that the cost of the comparator is negligible compared to the cost of branching, since that's what you're reducing with your approach. Of course, that is difficult if not impossible to tell. One approach would be to enable this optimization only when the comparator is something trivial like `std::less` and friends, but IDK if that would prevent this optimization in too many cases.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118029/new/

https://reviews.llvm.org/D118029