[libcxx-commits] [PATCH] D118029: Introduce branchless sorting functions for sort3, sort4 and sort5.

Fri Mar 4 07:33:29 PST 2022

marcogelmi marked an inline comment as done.
marcogelmi added inline comments.

================
Comment at: libcxx/include/__algorithm/sort.h:160
+inline _LIBCPP_HIDDEN
+    __enable_if_t<!_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
Quuxplusone wrote:
> Quuxplusone wrote:
> > marcogelmi wrote:
> > > philnik wrote:
> > > > marcogelmi wrote:
> > > > > marcogelmi wrote:
> > > > > > Quuxplusone wrote:
> > > > > > > philnik wrote:
> > > > > > > > Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.
> > > > > > > `is_scalar` might be relevant.
> > > > > > this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
> > > > > switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9
> > > > Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.
> > > so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below). 
> > > 
> > > I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1
> > > 
> > > I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...
> > > 
> > > Results on ARM:
> > > 
> > > ```
> > > name                            old cpu/op  new cpu/op  delta
> > > BM_Sort_optional_Random_1       4.02ns ± 0%  4.01ns ± 1%   -0.36%   (p=0.000 n=99+98)
> > > BM_Sort_optional_Random_3       5.33ns ± 0%  5.46ns ± 0%   +2.42%   (p=0.000 n=94+93)
> > > BM_Sort_optional_Random_4       5.39ns ± 0%  6.17ns ± 0%  +14.45%   (p=0.000 n=99+99)
> > > BM_Sort_optional_Random_5       6.47ns ± 0%  7.35ns ± 0%  +13.54%   (p=0.000 n=99+98)
> > > BM_Sort_optional_Random_16      10.2ns ± 1%  10.8ns ± 2%   +5.74%  (p=0.000 n=95+100)
> > > ```
> > > 
> > FWIW, I think it would be worth pulling out the condition into a named trait, like
> > ```
> > template<class _Tp>
> > using __use_branchless_sort = bool_constant<
> >   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> > >;
> > ```
> > All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator*`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.)
> > 
> > Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to //contiguous// iterators?
> > ```
> > template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
> > using __use_branchless_sort = bool_constant<
> >   __is_cpp17_contiguous_iterator<_Iter> &&
> >   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> > >;
> > ```
> > Also note I keep trying to expand `arithmetic` to `scalar`. :)
> > Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.
> 
> Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like
> ```
> int is[1000];
> Widget *ps[1000];
> std::sort(is, is+1000, std::less<>);  // A
> std::sort(ps, ps+1000, std::less<>);  // B
> ```
> but poorly on complicated predicates like
> ```
> Widget ws[10];
> std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; });  // C
> std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; });  // D
> ```
> That is, you expect that this optimization, applied across the board, would measurably //speed up// A and B, but measurably //slow down// C and D.  And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much //less// common than D.  So, your patch intentionally applies only to A and C, and excludes B and D.  Is that all correct?
> I do still think it'd be interesting to find out whether this patch slows down C and D.
You are correct. Our patch applies to A and B and we are pretty certain that our changes will slow down C and D, because our branchless version calls the comparator more times in the average case. This is why we have limited ourselves to A and B. Note that our patch not only applies to ints but also has similar performance improvements for floats. @ldionne it would be great to get your thoughts on the patch as to whether we can submit as is? Thanks in advance.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118029/new/

https://reviews.llvm.org/D118029