[libcxx-commits] [PATCH] D151717: [libc++][PSTL] Add a GCD backend

Tue Jun 27 12:04:21 PDT 2023

ldionne added inline comments.

================
Comment at: libcxx/include/__algorithm/pstl_backends/cpu_backends/libdispatch.h:158-172
+#ifndef _LIBCPP_HAS_NO_EXCEPTIONS
+  try {
+#endif
+    auto __ranges = __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp);
+
+    // __dispatch_apply is noexcept, so we don't convert any user exceptions
+    __libdispatch::__dispatch_apply(__ranges.size(), [&](size_t __index) {
----------------
ldionne wrote:
> IMO this captures better the fact that we are only catching exceptions in `__libdispatch::__calculate_merge_ranges`:
> 
> ```
> pmr::vector<...> __ranges;
> try {
>   __ranges = __calculate(...);
> } catch ( ) {
>   throw __pstl_bad_alloc();
> }
> 
> __libdispatch::__dispatch_apply(...);
> ```
> 
> This makes it more obvious that we let `__dispatch_apply` terminate if an exception is thrown.
> 
> Edit: This might not work if the allocator doesn't propagate.
```
pmr::vector<...> __ranges = [&]{
  try {
    return __libdispatch::__calculate_merge_ranges(__first1, __last1, __first2, __last2, __result, __comp)
  } catch (....) {
    throw ...;
  }
}();
```

Still debating whether that's overkill, but it works around the allocator propagation issue.

I think we should leave the code as-is after consideration.

================
Comment at: libcxx/include/__utility/terminate_on_exception.h:30-31
     return __func();
+  } catch (const __pstl_bad_alloc&) {
+    throw;
   } catch (...) {
----------------
I'm not sure this is correct as-is. Let's say we have a predicate to `find_if` that happens to use `__pstl_merge` in its implementation, and let's say the "setup" code for `__pstl_merge` fails to allocate and throws `__pstl_bad_alloc`. It will be caught by `find_if`'s `__terminate_on_exception` wrapper and be rethrown. Right? If so, then we'd be incorrectly propagating the exception.

Instead, I think we might want to instead use `__terminate_on_exception` in the backend implementation *only* around user code (not setup code), so we'd remove it from the various functions that take `__cpu_backend_tag`. It would also not need to handle `__pstl_bad_alloc` anymore cause it would never wrap "setup" code.

This should be done in a separate patch, and we'll likely need some tests as well.

================
Comment at: libcxx/src/CMakeLists.txt:318-319
+if (LIBCXX_PSTL_CPU_BACKEND STREQUAL "libdispatch")
+  set(LIBCXX_EXPERIMENTAL_SOURCES ${LIBCXX_EXPERIMENTAL_SOURCES}
+                                  pstl/libdispatch.cpp)
+endif()
----------------

================
Comment at: libcxx/src/pstl/libdispatch.cpp:11
+#include <__config>
+#include <dispatch/dispatch.h>
+#include <memory_resource>
----------------
We might want to move this system header below the library's own includes, I think that's what we usually do.

================
Comment at: libcxx/src/pstl/libdispatch.cpp:28-29
+
+__chunk_partitions __partition_chunks(ptrdiff_t element_count) {
+  const auto requested_chunk_size = __default_chunk_size / thread::hardware_concurrency();
+
----------------
Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do:

```
const auto target_number_of_chunks = thread::hardware_concurrency() * 50;
const auto chunk_size = element_count / target_number_of_chunks; // roughly
```

Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing.

At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on `__default_chunk_size` (which itself is 100% thin air).

You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree.

Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use `std::for_each(std::execution::par)` with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code.

---------------------------------------

In summary, I would suggest this scheme:

We spawn at least `min(element_count, thread::hardware_concurrency())` tasks always.
When `element_count > thread::hardware_concurrency()`, we increase logarithmically as a function of `element_count` with an asymptote located at roughly `100 * thread::hardware_concurrency()`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151717/new/

https://reviews.llvm.org/D151717