[libcxx-commits] [PATCH] D151717: [libc++][PSTL] Add a GCD backend

Thu Jun 15 14:02:15 PDT 2023

philnik created this revision.
Herald added a reviewer: ldionne.
Herald added a project: All.
philnik updated this revision to Diff 527088.
philnik marked 2 inline comments as done.
philnik added a comment.
philnik updated this revision to Diff 527581.
philnik marked 5 inline comments as done.
h-vetinari added a reviewer: nadiasvertex.
philnik updated this revision to Diff 529678.
philnik marked 3 inline comments as done.
philnik updated this revision to Diff 531780.
ldionne published this revision for review.
Herald added a project: libc++.
Herald added a subscriber: libcxx-commits.
Herald added a reviewer: libc++.

Address comments

philnik added a comment.

Address comments

h-vetinari added a comment.

Previous effort for this for cross-reference: https://reviews.llvm.org/D120186

philnik added a comment.

Address comments

philnik added a comment.

Rebased

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:33-34
+
+_LIBCPP_FUNC_VIS void
+__dispatch_apply(size_t __chunk_count, void* __context, void (*__func)(void* __context, size_t __chunk)) noexcept;
+
----------------

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:54
+_LIBCPP_HIDE_FROM_ABI __chunk_partitions
+__partition_chunks(_RandomAccessIterator __first, _RandomAccessIterator __last) {
+  const auto __requested_chunk_size = __default_chunk_size;
----------------
I would like us to test specifically the various functions we use in the backend, because otherwise we don't have any testing for those. So for example, let's create a libc++ specific test that checks for the implementation of `__partition_chunks`.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:89-90
+_LIBCPP_HIDE_FROM_ABI void __parallel_for(_RandomAccessIterator __first, _RandomAccessIterator __last, _Functor __f) {
+  // Create an executor over the range that will execute f in
+  // chunks over the range.
+
----------------
This comment seems not to be relevant anymore, we don't "create an executor".

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:120
+          typename _LeafMerge>
+_LIBCPP_HIDE_FROM_ABI void __parallel_merge_body(
+    std::size_t __size_x,
----------------
There is some non-trivial logic in here that we could get wrong. For example, if we were not to use `upper_bound` below to figure out the end of the `__xm` range to merge, we could get correctness issues -- this is clearly not only about performance.

We could either test this by adding GCD-specific tests for the backend, or by adding `std/` tests for e.g. `std::merge(std::par, ...)` that are specially crafted to trigger these issues. I don't care very strongly which one we go for, but we should have something because as it stands, all of our tests would pass if we had incorrect logic in here (our tests always use something smaller than the chunk size).

@philnik mentioned some sort of "fuzzing" test during live review, and I think I agree this would make sense.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:121-122
+_LIBCPP_HIDE_FROM_ABI void __parallel_merge_body(
+    std::size_t __size_x,
+    std::size_t __size_y,
+    _RandomAccessIterator1 __first1,
----------------
I don't think we need to pass the `__size_foo` variables around, we can compute them from the iterators instead since we have random access iterators.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:135-136
+
+  _RandomAccessIterator1 __xm;
+  _RandomAccessIterator2 __ym;
+
----------------
Maybe those are more descriptive names? Especially if we drop `__size_x` and `__size_y`, `x` and `y` will lose their meaning.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:142-143
+  } else {
+    __xm = __first1 + (__size_x / 2);
+    __ym = std::lower_bound(__first2, __last2, *__xm, __comp);
+  }
----------------
Is there any reason why we use `lower_bound` here?

@nadiasvertex Would you happen to remember why you did it that way in the original GCD backend implementation?

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:148
+
+  __libdispatch::__libdispatch_invoke_parallel(
+      [__first1, __xm, __first2, __ym, __result, __comp, &__leaf_merge]() {
----------------
This implementation is quite interesting. At a high level, we basically figure out how to chunk up the ranges as we're going, and we spawn tasks two-by-two every time. My understanding is that this kind of "tree of computation" pattern is not what libdispatch excels at -- instead I think it is better to give it a "finalized" number of tasks you want to execute all at once.

If that is correct, then going for a different algorithm where we instead figure out the chunking upfront *and then* dispatch it all at once could be a win. One way to do this could be to accumulate the work items (aka the arguments you pass to `__leaf_merge` above) in a `std::vector<_WorkItem>`, and then `dispatch_apply_f` on that. We're allowed to allocate in these algorithms so that should be an acceptable approach. Since the number of work items is roughly `n / __default_chunk_size` and that's linear, the number of work items we might have to spawn could become quite large, perhaps making it important to use `dispatch_apply_f` only once. Unfortunately this also means that it's difficult to determine the number of work items up front so we would likely need to allocate with our `std::vector` almost all the time (i.e. tricks like `llvm::SmallVector` likely don't apply here).

I think the first step to resolve this comment is to figure out whether the premise that libdispatch doesn't handle those trees well is true or not. If not, then the current approach is probably the right one.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:191-192
+
+  if (__values == nullptr)
+    std::__throw_pstl_bad_alloc();
+
----------------
We need a test for this case -- i.e. we want to make sure that we throw an exception (and the right one) if we fail to allocate memory from the implementation of the GCD backend.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:162
+          typename _LeafMerge>
+_LIBCPP_HIDE_FROM_ABI void __parallel_merge(
+    _RandomAccessIterator1 __xs,
----------------
If the only difference between `__parallel_merge` and `__parallel_merge_body` is that we're passing the size, then `__parallel_merge` could be implemented as a simple call to `__parallel_merge_body(__xe - __xs, __ye - __ys)`.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:211-212
+  auto __partitions = __libdispatch::__partition_chunks(__first, __last);
+  auto __deleter = [&](char* __ptr) { std::destroy_n(reinterpret_cast<_Value*>(__ptr), __partitions.__chunk_count_); };
+  unique_ptr<char, decltype(__deleter)> __values(
+      ::operator new(__partitions.__chunk_count_, align_val_t{alignof(_Value)}, nothrow), __deleter);
----------------
I think we should introduce a helper class for this. It seems like something we might have a other uses for. Something like:

```
template <class _Tp, class _Alloc = std::allocator<_Tp>>
struct __uninitialized_buffer {
  explicit __uninitialized_buffer(size_t __n);
  _Tp* __get();
  ~__uninitialized_buffer(); // destroys the elements and frees the memory
};
```

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:218
+
+  auto __guard = std::__make_exception_guard([]() { std::terminate(); });
+  __libdispatch::__dispatch_apply(__partitions.__chunk_count_, [&](size_t __chunk) {
----------------
This totally deserves a comment explaining why we're terminating here, when normally we always terminate from `terminate_on_exception`. Actually, we could also use `terminate_on_exception` here instead of `exception_guard`.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:43
+struct __chunk_partitions {
+  size_t __chunk_count_;
+  size_t __chunk_size_;
----------------
If it does.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:54-56
+  auto& __chunk_size          = __partitions.__chunk_size_;
+  auto& __first_chunk_size    = __partitions.__first_chunk_size_;
+  auto& __n_chunks            = __partitions.__chunk_count_;
----------------
This seems a bit convoluted to me. Either set `__partitions.__x` directly or use local variables and then return `__chunk_partitions{_vars...}`?

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:59
+
+  const _Size __first_chunk_size_ = __last - __first;
+  if (__first_chunk_size_ < __requested_chunk_size) {
----------------
This variable name is super confusing with the above!

Also I would probably name it something like `__n`, which makes it clear that this is the total number of elements in the range.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:107
+template <class _Func1, class _Func2>
+_LIBCPP_HIDE_FROM_ABI void __parallel_invoke(_Func1&& __f1, _Func2&& __f2) {
+  __libdispatch::__dispatch_apply(2, [&](size_t __index) {
----------------
I would rename this to avoid `__parallel` in the name since it makes it look like a CPU backend customization point, when it's not.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:212
+  auto __partitions = __libdispatch::__partition_chunks(__first, __last);
+  unique_ptr<_Value> __values(::operator new(__partitions.__chunk_count_, align_val_t{alignof(_Value)}, nothrow));
+  if (__values == nullptr)
----------------
Otherwise this code doesn't work as-is.

We also discussed using `unique_ptr<optional<_Value>[]>` which would destroy objects correctly in case an exception is thrown while constructing values below. This potentially has a non-negligible memory overhead. No conclusion yet, but I think I would like to either make this code locally correct or have a nice way of expressing the fact that we don't care about running the destructors of `_Value` in case of an exception because we call terminate anyway.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:48
+template <class _RandomAccessIterator>
+__chunk_partitions __partition_chunks(_RandomAccessIterator __first, _RandomAccessIterator __last) {
+  using _Size = __iter_diff_t<_RandomAccessIterator>;
----------------
`HIDE_FROM_ABI` here and below

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:104-112
+template <class _Func1, class _Func2>
+void __parallel_invoke(_Func1&& __f1, _Func2&& __f2) {
+  __libdispatch::__dispatch_apply(2, [&](size_t __index) {
+    if (__index == 0)
+      __f1();
+    else
+      __f2();
----------------
If this isn't part of our API right now, I wouldn't provide it. I'd add it once we need it.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/transform_reduce.h:166
           std::move(__init),
-          std::move(__reduce),
-          [=](_ForwardIterator __brick_first, _ForwardIterator __brick_last, _Tp __brick_init) {
+          __reduce,
+          [=](auto __brick_first, auto __brick_last, _Tp __brick_init) {
----------------
Can we add a test that fails here? This is a double-move issue that was present in the code before this patch. I think it would make sense to split this one off.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/transform_reduce.h:167
+          __reduce,
+          [=](auto __brick_first, auto __brick_last, _Tp __brick_init) {
             return std::__pstl_transform_reduce<__remove_parallel_policy_t<_ExecutionPolicy>>(
----------------
Let's capture by name since this was *so* not obvious.

================
Comment at: libcxx/src/new.cpp:51-53
+const char* __pstl_bad_alloc::what() const noexcept {
+  return "PSTL memory allocation failed";
+}
----------------
This should be in `libc++experimental.a` for now instead.

================
Comment at: libcxx/src/pstl/libdispatch.cpp:1
+//===----------------------------------------------------------------------===//
+//
----------------
We need to figure out how we're going to test the backend. Tests should be added in this patch.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D151717

Files:
  libcxx/CMakeLists.txt
  libcxx/include/CMakeLists.txt
  libcxx/include/__algorithm/pstl_backend.h
  libcxx/include/__algorithm/pstl_backends/cpu_backends/backend.h
  libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h
  libcxx/include/__algorithm/pstl_backends/cpu_backends/transform_reduce.h
  libcxx/include/__config_site.in
  libcxx/include/__utility/terminate_on_exception.h
  libcxx/include/new
  libcxx/src/CMakeLists.txt
  libcxx/src/new.cpp
  libcxx/src/pstl/libdispatch.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D151717.531780.patch
Type: text/x-patch
Size: 15950 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/libcxx-commits/attachments/20230615/1b132b48/attachment-0001.bin>