[libcxx-commits] [PATCH] D118029: Introduce branchless sorting functions for sort3, sort4 and sort5.

Thu Mar 31 05:01:25 PDT 2022

marcogelmi marked an inline comment as done.
marcogelmi added a comment.

Thanks for your in-depth feedback!

================
Comment at: libcxx/include/__algorithm/sort.h:118
                     swap(*__x1, *__x2);
                     ++__r;
                 }
----------------
ldionne wrote:
> Commenting here for lack of better place: With this patch, I think we need to add tests for `std::sort` for a random access but non-contiguous range, otherwise we may end up testing. And for sorting on non arithmetic types, too. It looks like our `std::sort` tests might be lacking quite a bit, now that I look at them.
I've added tests for non-contiguous iterators (using a `std::deque<int>`) and non-arithmetic types (using a `std::pair<int, int>`). Let me know if there are other tests I'm missing. I've also added more checks using `std::is_permutation` because the current tests only check for `std::is_sorted`

================
Comment at: libcxx/include/__algorithm/sort.h:154-155
+
+template <class _Compare, class _RandomAccessIterator>
+inline _LIBCPP_HIDE_FROM_ABI __enable_if_t<__use_branchless_sort<_RandomAccessIterator>::value, void>
+__sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,
----------------
ldionne wrote:
> We usually use the following idiom for SFINAE (this is supported in C++03 as a Clang extension):
> 
> ```
> template <class _Compare, class _RandomAccessIterator, __enable_if_t<
>     __use_branchless_sort<_RandomAccessIterator>::value
> >* = 0>
> inline _LIBCPP_HIDE_FROM_ABI
> void __sort3_maybe_branchless(...) { ... }
> ```
> 
> It has the benefit that the return type is not hidden in the `__enable_if_t`.
> 
this was mentioned by another reviewer but if I try using this idiom I get failures for c++03: 

```
candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,
```

================
Comment at: libcxx/include/__algorithm/sort.h:160
+inline _LIBCPP_HIDDEN
+    __enable_if_t<!_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
ldionne wrote:
> marcogelmi wrote:
> > Quuxplusone wrote:
> > > Quuxplusone wrote:
> > > > marcogelmi wrote:
> > > > > philnik wrote:
> > > > > > marcogelmi wrote:
> > > > > > > marcogelmi wrote:
> > > > > > > > Quuxplusone wrote:
> > > > > > > > > philnik wrote:
> > > > > > > > > > Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.
> > > > > > > > > `is_scalar` might be relevant.
> > > > > > > > this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
> > > > > > > switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9
> > > > > > Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.
> > > > > so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below). 
> > > > > 
> > > > > I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1
> > > > > 
> > > > > I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...
> > > > > 
> > > > > Results on ARM:
> > > > > 
> > > > > ```
> > > > > name                            old cpu/op  new cpu/op  delta
> > > > > BM_Sort_optional_Random_1       4.02ns ± 0%  4.01ns ± 1%   -0.36%   (p=0.000 n=99+98)
> > > > > BM_Sort_optional_Random_3       5.33ns ± 0%  5.46ns ± 0%   +2.42%   (p=0.000 n=94+93)
> > > > > BM_Sort_optional_Random_4       5.39ns ± 0%  6.17ns ± 0%  +14.45%   (p=0.000 n=99+99)
> > > > > BM_Sort_optional_Random_5       6.47ns ± 0%  7.35ns ± 0%  +13.54%   (p=0.000 n=99+98)
> > > > > BM_Sort_optional_Random_16      10.2ns ± 1%  10.8ns ± 2%   +5.74%  (p=0.000 n=95+100)
> > > > > ```
> > > > > 
> > > > FWIW, I think it would be worth pulling out the condition into a named trait, like
> > > > ```
> > > > template<class _Tp>
> > > > using __use_branchless_sort = bool_constant<
> > > >   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> > > > >;
> > > > ```
> > > > All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator*`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.)
> > > > 
> > > > Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to //contiguous// iterators?
> > > > ```
> > > > template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
> > > > using __use_branchless_sort = bool_constant<
> > > >   __is_cpp17_contiguous_iterator<_Iter> &&
> > > >   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> > > > >;
> > > > ```
> > > > Also note I keep trying to expand `arithmetic` to `scalar`. :)
> > > > Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.
> > > 
> > > Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like
> > > ```
> > > int is[1000];
> > > Widget *ps[1000];
> > > std::sort(is, is+1000, std::less<>);  // A
> > > std::sort(ps, ps+1000, std::less<>);  // B
> > > ```
> > > but poorly on complicated predicates like
> > > ```
> > > Widget ws[10];
> > > std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; });  // C
> > > std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; });  // D
> > > ```
> > > That is, you expect that this optimization, applied across the board, would measurably //speed up// A and B, but measurably //slow down// C and D.  And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much //less// common than D.  So, your patch intentionally applies only to A and C, and excludes B and D.  Is that all correct?
> > > I do still think it'd be interesting to find out whether this patch slows down C and D.
> > You are correct. Our patch applies to A and B and we are pretty certain that our changes will slow down C and D, because our branchless version calls the comparator more times in the average case. This is why we have limited ourselves to A and B. Note that our patch not only applies to ints but also has similar performance improvements for floats. @ldionne it would be great to get your thoughts on the patch as to whether we can submit as is? Thanks in advance.
> Let me just try to state this in the most naive way possible to make sure I'm on the same page as you folks:
> 
> Basically, you assume that sorts on arithmetic types over contiguous ranges will not be using complicated comparators, and that you'll save more by avoiding branches and performing more calls to `compare` than reducing the number of calls to `compare` but having more branches. However, **if** we have a contiguous range of arithmetic types **but** the `compare` is extremely expensive, this optimization will actually be a pessimization over the naive approach. Is that correct?
> 
> If that is correct, then basically in an ideal world, what you'd want would be to enable this optimization only when you know that the cost of the comparator is negligible compared to the cost of branching, since that's what you're reducing with your approach. Of course, that is difficult if not impossible to tell. One approach would be to enable this optimization only when the comparator is something trivial like `std::less` and friends, but IDK if that would prevent this optimization in too many cases.
Yes your understanding is correct. After discussing your feedback amongst ourselves, we think that you had a great suggestion to enable our patch for simple comparators.

Note that I've had to enable it for `__less<_Tp>&` because that seems to be the default when nothing is passed in. Also, the comparator seems to be transformed into a reference using `__comp_ref_type` so that's why it's templated on the reference type.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118029/new/

https://reviews.llvm.org/D118029