[libcxx-commits] [PATCH] D118029: Introduce branchless sorting functions for sort3, sort4 and sort5.

Arthur O'Dwyer via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Fri Feb 11 13:22:52 PST 2022


Quuxplusone added inline comments.


================
Comment at: libcxx/benchmarks/algorithms.bench.cpp:30-31
             V() == ValueType::Pair, std::pair<uint32_t, uint32_t>,
-            std::conditional_t<V() == ValueType::Tuple,
-                               std::tuple<uint32_t, uint64_t, uint32_t>,
-                               std::string> > > >;
+            std::conditional_t< V() == ValueType::Tuple, std::tuple<uint32_t, uint64_t, uint32_t>,
+                                std::conditional_t< V() == ValueType::String, std::string, float> > > > >;
 
----------------
philnik wrote:
> 
It might be (long past) time to change this to simply
```
using Types = std::tuple<
    uint32_t, uint64_t, std::pair<uint32_t, uint64_t>, std::tuple<uint32_t, uint64_t, uint32_t>,
    std::string, float
>;

template<class V>
using Value = std::tuple_element_t<(int)V(), Types>;
```


================
Comment at: libcxx/include/__algorithm/sort.h:187-191
+template <typename _Compare, typename _RandomAccessIterator>
+inline _LIBCPP_HIDE_FROM_ABI
+    __enable_if_t<is_arithmetic<typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort5_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,
+                             _RandomAccessIterator __x4, _RandomAccessIterator __x5, _Compare __c) {
----------------
marcogelmi wrote:
> Quuxplusone wrote:
> > Nit: It's more libc++ style to use `enable_if_t<>* = nullptr` than return type SFINAE; I'd say the advantages are readability and that return type SFINAE greatly increases the length of the mangled name while an extra `void*` type parameter doesn't so much.
> > Drive-by `s/typename/class/` and whitespace.
> > Likewise below.
> > 
> I'm not entirely sure why, but this seems to be failing for the c++03 build (it works fine for the others):, e.g. https://reviews.llvm.org/harbormaster/unit/view/2520185/
> 
> ```
> candidate template ignored: substitution failure [with _Compare = std::__debug_less<NonConstArgCmp>, _RandomAccessIterator = int *]: non-type template argument does not refer to any declaration
> inline _LIBCPP_HIDE_FROM_ABI void __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2,
> ```
> 
> 
Yuck, I see that on Godbolt (C++03 is unhappy with giving a non-type template parameter a default value). So I guess the return type SFINAE is fine.


================
Comment at: libcxx/include/__algorithm/sort.h:152
+inline _LIBCPP_HIDDEN
+    __enable_if_t<_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
marcogelmi wrote:
> Quuxplusone wrote:
> > philnik wrote:
> > > Ditto below
> > And vice versa, `_VSTD::__sort3_internal`. All function calls (that take arguments of user-defined types, anyway) need qualification specifically so that you don't get ADL. Type/variable/concept/... names don't need qualification because they never do ADL.
> > Also, almost certainly `_VSTD::__sort3_internal<_Compare>`, right? We should have a test for "no unnecessary copies of predicates" that catches this, and if it's not failing on this PR then it needs a bit of investigation why not.
> sorry what line of code is this referring to? I thought I used _VSTD::__sort3_internal<_Compare> everywhere I'm calling it.
> sorry what line of code is this referring to?

My best guess is that I misread line 153
```
    __sort3_maybe_branchless(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3,
```
as a function call. :) No action required.


================
Comment at: libcxx/include/__algorithm/sort.h:160
+inline _LIBCPP_HIDDEN
+    __enable_if_t<!_VSTD::is_integral< typename iterator_traits<_RandomAccessIterator>::value_type>::value, void>
+    __sort3_internal(_RandomAccessIterator __x1, _RandomAccessIterator __x2, _RandomAccessIterator __x3, _Compare __c) {
----------------
marcogelmi wrote:
> philnik wrote:
> > marcogelmi wrote:
> > > marcogelmi wrote:
> > > > Quuxplusone wrote:
> > > > > philnik wrote:
> > > > > > Is there a reason you restricted this to integrals? Does it pessimize some code? I guessing at least all primitive types should be allowed. I haven't benchmarked, but looking at code generation it seems floating points should also benefit from this change.
> > > > > `is_scalar` might be relevant.
> > > > this relies on cmovs, on larger data types it could pessimize the code if it compiles to extra copies instead, or not compile at all if it's not trivially copyable.
> > > switched to is_arithmetic and added benchmarks here: https://bit.ly/3rgv2r9
> > Could we generalize it to something like `sizeof(typename iterator_traits<_RandomAccessIterator>::value_type) <= sizeof(size_t) && is_trivially_copyable_v<typename iterator_traits<_RandomAccessIterator>::value_type>`? I'm assuming here that `sizeof(size_t)` is less than or equal to the word size. This would allow this optimization to be performed on something like `std::optional<int>`. Could you check if that still yields a performance improvement? `std::optional` doesn't have a trivial comparison operator, so it might be interesting.
> so I tried testing it on `optional<int>` and the results are underwhelming (pretty much equivalent on intel skylake and actually slower on ARM - see below). 
> 
> I think this is because the compiler doesn't generate conditional moves, see https://godbolt.org/z/o6648s6f1
> 
> I wonder if there is a way to ensure it's enabled only for types for which conditional moves will be generated...
> 
> Results on ARM:
> 
> ```
> name                            old cpu/op  new cpu/op  delta
> BM_Sort_optional_Random_1       4.02ns ± 0%  4.01ns ± 1%   -0.36%   (p=0.000 n=99+98)
> BM_Sort_optional_Random_3       5.33ns ± 0%  5.46ns ± 0%   +2.42%   (p=0.000 n=94+93)
> BM_Sort_optional_Random_4       5.39ns ± 0%  6.17ns ± 0%  +14.45%   (p=0.000 n=99+99)
> BM_Sort_optional_Random_5       6.47ns ± 0%  7.35ns ± 0%  +13.54%   (p=0.000 n=99+98)
> BM_Sort_optional_Random_16      10.2ns ± 1%  10.8ns ± 2%   +5.74%  (p=0.000 n=95+100)
> ```
> 
FWIW, I think it would be worth pulling out the condition into a named trait, like
```
template<class _Tp>
using __use_branchless_sort = bool_constant<
  sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
>;
```
All this focus on `value_type` has recently made me paranoid about `vector<bool>` and proxy iterators in general. I see your `__partially_sorted_swap` doing like 13 calls to `operator*`; are we sure that's still going to be efficient when the `_RandomAccessIterator` in question is an arbitrarily complicated Ranges thing? (We don't support `ranges::sort` yet, but it would be unfortunate if we poured all this effort into `std::sort` and then had to re-evaluate the entire thing for `ranges::sort`.)

Do you actually have a use-case for arbitrary random-access iterators? Could we limit this to //contiguous// iterators?
```
template<class _Iter, class _Tp = typename iterator_traits<_Iter>::value_type>
using __use_branchless_sort = bool_constant<
  __is_cpp17_contiguous_iterator<_Iter> &&
  sizeof(_Tp) <= sizeof(void*) && is_scalar<_Tp>::value
>;
```
Also note I keep trying to expand `arithmetic` to `scalar`. :)


================
Comment at: libcxx/test/libcxx/algorithms/robust_against_copying_comparators.pass.cpp:177
+    (void)std::set_union(first, mid, mid, last, first2, Less<T>(&copies)); assert(copies == 0);
+    (void)std::sort(first, last, Less<T>(&copies)); assert(copies == 0);
+    (void)std::sort_heap(first, last, Less<T>(&copies)); assert(copies == 0);
----------------
IIUC, what you've done here is just add a call to `std::sort` with `value_type=int`, which triggers your new thing, specifically because your new thing is constrained to `arithmetic` value_types instead of (as I keep trying to drive-by it into) `scalar` value_types. (Pointer types are scalar but not arithmetic.)

Is this the only change that was needed in order to trigger your comparator-copying codepath?
If so, I'd slightly rather see you just expand your new thing to include `scalar` types!

I don't really object to the idea of templatizing the whole thing on `T` and testing with different `T`s; but it feels like a lot of churn if the //only// culprit was `arithmetic/scalar`, and also I'm skeptical that that //was// the only culprit here.

I had thought that the problem was more about the shape of the input data — I mean, notice that `is_sorted(first, last)` is already true going into this call.

In particular, if we now know that there are different codepaths taken for arrays of length 3, 4, 5, then shouldn't we be testing here
```
(void)std::sort(first, first+3, Less(&copies)); assert(copies == 0);
(void)std::sort(first, first+4, Less(&copies)); assert(copies == 0);
(void)std::sort(first, first+5, Less(&copies)); assert(copies == 0);
(void)std::sort(first, last, Less(&copies)); assert(copies == 0);
```
?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118029/new/

https://reviews.llvm.org/D118029



More information about the libcxx-commits mailing list