[libcxx-commits] [PATCH] D118029: Introduce branchless sorting functions for sort3, sort4 and sort5.

Thu Feb 24 16:02:09 PST 2022

Quuxplusone added inline comments.

================
Comment at: libcxx/benchmarks/algorithms.bench.cpp:30-31
             V() == ValueType::Pair, std::pair<uint32_t, uint32_t>,
-            std::conditional_t<V() == ValueType::Tuple,
-                               std::tuple<uint32_t, uint64_t, uint32_t>,
-                               std::string> > > >;
+            std::conditional_t< V() == ValueType::Tuple, std::tuple<uint32_t, uint64_t, uint32_t>,
+                                std::conditional_t< V() == ValueType::String, std::string, float> > > > >;

----------------
marcogelmi wrote:
> Quuxplusone wrote:
> > marcogelmi wrote:
> > > Quuxplusone wrote:
> > > > philnik wrote:
> > > > > 
> > > > It might be (long past) time to change this to simply
> > > > ```
> > > > using Types = std::tuple<
> > > >     uint32_t, uint64_t, std::pair<uint32_t, uint64_t>, std::tuple<uint32_t, uint64_t, uint32_t>,
> > > >     std::string, float
> > > > >;
> > > > 
> > > > template<class V>
> > > > using Value = std::tuple_element_t<(int)V(), Types>;
> > > > ```
> > > note that I'm using clang-format and it's changing the format slightly. Also, I had to change to a plain enum
> > For my curiosity, why did you have to change to a plain enum? (All else being equal, we should prefer `enum class`, but if it really technically doesn't work, then I don't imagine it's a big deal.)
> this is just because we need to cast it to `int` to index into the `Types` tuple and that's more tricky with an `enum class`.
Ahhh, I see now. There's a pair of parens missing from my original code:
```
using Value = std::tuple_element_t<(int)V(), Types>;
```
should have been either
```
using Value = std::tuple_element_t<(int)V()(), Types>;  // or, better,
using Value = std::tuple_element_t<(int)V::value, Types>;
```
Use either of those instead, please. (Preferably the latter.)

================
Comment at: libcxx/include/__algorithm/sort.h:160
+inline _LIBCPP_HIDDEN
+    __enable_if_t<!_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
Quuxplusone wrote:
> marcogelmi wrote:
> > philnik wrote:
> > > marcogelmi wrote:
> > > > marcogelmi wrote:
> > > > > Quuxplusone wrote:
> > > > > > philnik wrote:
> > > > > > > Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.
> > > > > > `is_scalar` might be relevant.
> > > > > this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
> > > > switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9
> > > Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.
> > so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below). 
> > 
> > I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1
> > 
> > I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...
> > 
> > Results on ARM:
> > 
> > ```
> > name                            old cpu/op  new cpu/op  delta
> > BM_Sort_optional_Random_1       4.02ns ± 0%  4.01ns ± 1%   -0.36%   (p=0.000 n=99+98)
> > BM_Sort_optional_Random_3       5.33ns ± 0%  5.46ns ± 0%   +2.42%   (p=0.000 n=94+93)
> > BM_Sort_optional_Random_4       5.39ns ± 0%  6.17ns ± 0%  +14.45%   (p=0.000 n=99+99)
> > BM_Sort_optional_Random_5       6.47ns ± 0%  7.35ns ± 0%  +13.54%   (p=0.000 n=99+98)
> > BM_Sort_optional_Random_16      10.2ns ± 1%  10.8ns ± 2%   +5.74%  (p=0.000 n=95+100)
> > ```
> > 
> FWIW, I think it would be worth pulling out the condition into a named trait, like
> ```
> template<class _Tp>
> using __use_branchless_sort = bool_constant<
>   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> >;
> ```
> All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator*`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.)
> 
> Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to //contiguous// iterators?
> ```
> template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
> using __use_branchless_sort = bool_constant<
>   __is_cpp17_contiguous_iterator<_Iter> &&
>   sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> >;
> ```
> Also note I keep trying to expand `arithmetic` to `scalar`. :)
> Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.

Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like
```
int is[1000];
Widget *ps[1000];
std::sort(is, is+1000, std::less<>);  // A
std::sort(ps, ps+1000, std::less<>);  // B
```
but poorly on complicated predicates like
```
Widget ws[10];
std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; });  // C
std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; });  // D
```
That is, you expect that this optimization, applied across the board, would measurably //speed up// A and B, but measurably //slow down// C and D.  And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much //less// common than D.  So, your patch intentionally applies only to A and C, and excludes B and D.  Is that all correct?
I do still think it'd be interesting to find out whether this patch slows down C and D.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118029/new/

https://reviews.llvm.org/D118029