[libcxx-commits] [PATCH] D118029: Introduce branchless sorting functions for sort3, sort4 and sort5.
Arthur O'Dwyer via Phabricator via libcxx-commits
libcxx-commits at lists.llvm.org
Thu Feb 24 16:02:09 PST 2022
Quuxplusone added inline comments.
================
Comment at: libcxx/benchmarks/algorithms.bench.cpp:30-31
V() == ValueType::Pair, std::pair<uint32_t, uint32_t>,
- std::conditional_t<V() == ValueType::Tuple,
- std::tuple<uint32_t, uint64_t, uint32_t>,
- std::string> > > >;
+ std::conditional_t< V() == ValueType::Tuple, std::tuple<uint32_t, uint64_t, uint32_t>,
+ std::conditional_t< V() == ValueType::String, std::string, float> > > > >;
----------------
marcogelmi wrote:
> Quuxplusone wrote:
> > marcogelmi wrote:
> > > Quuxplusone wrote:
> > > > philnik wrote:
> > > > >
> > > > It might be (long past) time to change this to simply
> > > > ```
> > > > using Types = std::tuple<
> > > > uint32_t, uint64_t, std::pair<uint32_t, uint64_t>, std::tuple<uint32_t, uint64_t, uint32_t>,
> > > > std::string, float
> > > > >;
> > > >
> > > > template<class V>
> > > > using Value = std::tuple_element_t<(int)V(), Types>;
> > > > ```
> > > note that I'm using clang-format and it's changing the format slightly. Also, I had to change to a plain enum
> > For my curiosity, why did you have to change to a plain enum? (All else being equal, we should prefer `enum class`, but if it really technically doesn't work, then I don't imagine it's a big deal.)
> this is just because we need to cast it to `int` to index into the `Types` tuple and that's more tricky with an `enum class`.
Ahhh, I see now. There's a pair of parens missing from my original code:
```
using Value = std::tuple_element_t<(int)V(), Types>;
```
should have been either
```
using Value = std::tuple_element_t<(int)V()(), Types>; // or, better,
using Value = std::tuple_element_t<(int)V::value, Types>;
```
Use either of those instead, please. (Preferably the latter.)
================
Comment at: libcxx/include/__algorithm/sort.h:160
+inline _LIBCPP_HIDDEN
+ __enable_if_t<!_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+ __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
Quuxplusone wrote:
> marcogelmi wrote:
> > philnik wrote:
> > > marcogelmi wrote:
> > > > marcogelmi wrote:
> > > > > Quuxplusone wrote:
> > > > > > philnik wrote:
> > > > > > > Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.
> > > > > > `is_scalar` might be relevant.
> > > > > this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
> > > > switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9
> > > Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.
> > so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below).
> >
> > I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1
> >
> > I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...
> >
> > Results on ARM:
> >
> > ```
> > name old cpu/op new cpu/op delta
> > BM_Sort_optional_Random_1 4.02ns ± 0% 4.01ns ± 1% -0.36% (p=0.000 n=99+98)
> > BM_Sort_optional_Random_3 5.33ns ± 0% 5.46ns ± 0% +2.42% (p=0.000 n=94+93)
> > BM_Sort_optional_Random_4 5.39ns ± 0% 6.17ns ± 0% +14.45% (p=0.000 n=99+99)
> > BM_Sort_optional_Random_5 6.47ns ± 0% 7.35ns ± 0% +13.54% (p=0.000 n=99+98)
> > BM_Sort_optional_Random_16 10.2ns ± 1% 10.8ns ± 2% +5.74% (p=0.000 n=95+100)
> > ```
> >
> FWIW, I think it would be worth pulling out the condition into a named trait, like
> ```
> template<class _Tp>
> using __use_branchless_sort = bool_constant<
> sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> >;
> ```
> All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator*`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.)
>
> Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to //contiguous// iterators?
> ```
> template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
> using __use_branchless_sort = bool_constant<
> __is_cpp17_contiguous_iterator<_Iter> &&
> sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
> >;
> ```
> Also note I keep trying to expand `arithmetic` to `scalar`. :)
> Regarding `is_arithmetic` vs `is_scalar`: expanding to `is_scalar` would include pointer types (which are likely to have a custom comparator) so we would have to test whether our changes provide similar improvements when you have a complex custom comparator.
Hmm. IOW, you're saying that you expect this optimization to perform well on simple predicates like
```
int is[1000];
Widget *ps[1000];
std::sort(is, is+1000, std::less<>); // A
std::sort(ps, ps+1000, std::less<>); // B
```
but poorly on complicated predicates like
```
Widget ws[10];
std::sort(is, is+1000, [&](int i, int j) { return ws[i] < ws[j]; }); // C
std::sort(ps, ps+1000, [](auto *pi, auto *pj) { return *pi < *pj; }); // D
```
That is, you expect that this optimization, applied across the board, would measurably //speed up// A and B, but measurably //slow down// C and D. And you expect that "pointer-ness" correlates heavily with "complicated-predicate-ness," so A is relatively much more common than C but B is much //less// common than D. So, your patch intentionally applies only to A and C, and excludes B and D. Is that all correct?
I do still think it'd be interesting to find out whether this patch slows down C and D.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D118029/new/
https://reviews.llvm.org/D118029
More information about the libcxx-commits
mailing list