[libcxx-commits] [PATCH] D151717: [libc++][PSTL] Add a GCD backend

Thu Jun 29 10:46:19 PDT 2023

ldionne added inline comments.

================
Comment at: libcxx/src/pstl/libdispatch.cpp:28-29
+
+__chunk_partitions __partition_chunks(ptrdiff_t element_count) {
+  const auto requested_chunk_size = __default_chunk_size / thread::hardware_concurrency();
+
----------------
ldionne wrote:
> Let's say we want 3x to 100x as many "tasks" as we have cores. Let's say for now that we always want 50x as many, just to pick something. Then we could also do:
> 
> ```
> const auto target_number_of_chunks = thread::hardware_concurrency() * 50;
> const auto chunk_size = element_count / target_number_of_chunks; // roughly
> ```
> 
> Then we could also add some logic like not having chunks smaller than X elements or something like that. Or we could make the 50x scale from 3x to 100x based on the number of elements we're processing.
> 
> At the end of the day we're pulling those numbers out of thin air, but we might be closer to libdispatch guidelines with something like the above than by basing the calculation on `__default_chunk_size` (which itself is 100% thin air).
> 
> You mentioned we probably want to have a logarithmic growth that tops off at (say) 100x the number of cores, and starts "somewhere". I think I agree.
> 
> Another observation is that we should probably err in favour of spawning more tasks than spawning fewer tasks. At the end of the day, the user did request the parallel version of the algorithm. If they use `std::for_each(std::execution::par)` with a vector of 10 elements, I would argue the user expects us to spawn some tasks, not to say "oh 10 is really small, let me serialize everything". I would even go as far as to say that we might want to always spawn at least as many tasks as we have cores, and in the worst case those tasks are really trivial, the scheduling overhead beats the benefit of running concurrently and the user made a poor decision to try and parallelize that part of their code.
> 
> ---------------------------------------
> 
> In summary, I would suggest this scheme:
> 
> We spawn at least `min(element_count, thread::hardware_concurrency())` tasks always.
> When `element_count > thread::hardware_concurrency()`, we increase logarithmically as a function of `element_count` with an asymptote located at roughly `100 * thread::hardware_concurrency()`.
> 
For small numbers of elements (`< cores`), we could create one task for each element. That's the only way to nicely handle the case where processing each element is really heavy and we really want to parallelize. Also it should be rare that users use `std::execution::par` to sort a vector of 3 ints. If they do, we have no way to tell and I'd argue it's not unreasonable for us to spawn 3 tasks.
For medium numbers of elements (say `< 500`): `cores + ((n-cores) / cores)`. This gives us a smooth transition from the previous size and then it basically grows linearly with `n`.
For larger numbers: Assuming 8 cores, we have: `log(1.01, sqrt(n)) = 800` (aka `100.499 * log(sqrt(n))` according to Wolfram Alpha) requires `n` to be roughly 8.2 million elements. That means that we'd create `100 * 8` tasks at 8.2 million elements, and then it grows really slowly. That doesn't seem unreasonable.

This is not very scientific, I'm afraid, but this seems to be somewhat reasonable while playing around with a couple of values.

```
size_t cores = thread::hardware_concurrency();

auto small = [](ptrdiff_t n) { return n; };
auto medium = [](ptrdiff_t n) { return cores + ((n-cores) / cores); };

// explain where this comes from, in particular that this is an approximation of `log(1.01, sqrt(n))` which seemed to be reasonable for `n` larger than 500 and tops at 800 tasks for n ~ 8 million
auto large = [](ptrdiff_t n) { return 100.499 * std::log(std::sqrt(n)); };

if (n < cores)    return small(n);
else if (n < 500) return medium(n);
else              return std::min(medium(n), large(n)); // provide a "smooth" transition
```

The above assumes that we have 8 cores, but everything kind of needs to be adjusted for different numbers of cores. I suggest we start with this just to unblock the patch, but leave a big fat comment explaining this needs to be revisited. We can take a look with the dispatch folks, or if anyone observing this review has an idea, please feel free to chime in.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D151717/new/

https://reviews.llvm.org/D151717