[PATCH] [Core] Add parallel infrastructure to lld.

Tue Apr 9 17:38:57 PDT 2013


================
Comment at: include/lld/Core/Parallel.h:68-69
@@ +67,4 @@
+public:
+  ThreadPoolExecutor(std::size_t threadCount =
+                         std::thread::hardware_concurrency())
+      : _stop(false), _done(threadCount) {
----------------
Sean Silva wrote:
> Michael Spencer wrote:
> > Shankar Kalpathi Easwaran wrote:
> > > I think size_t is pretty big for a threadcount.
> > I'll change it to the return type of hardware_concurrency (unsigned).
> The core count is probably not a good choice for the default, since then you risk getting blocked waiting on IO.
> 
> Since LLD uses memory-mapped IO, there's no way to do it "async" since the IO will be incurred when you take a major fault (which is transparent to the program). So, for tasks that require IO, you really need as many threads as parallel IO's that you are going to be doing. Another alternative is careful use of madvise(), but I don't think LLD is currently introspective enough about its access patterns to be able to effectively do that at this point.
> 
> For compute-intensive tasks though, core count is probably fine. So I would discourage having a this take a default. Only the client of the API really knows how much parallelism they need, and they should specify it explicitly.
This is for the default Executor, there's really not a better choice to make. If a specific client finds that it needs more, it can create its own ThreadPoolExecutor and give some other thread count. Removing the default here would just move it over to getDefaultExecutor().

================
Comment at: include/lld/Core/Parallel.h:194-196
@@ +193,5 @@
+
+template <class RandomAccessIterator, class Comp>
+void parallel_quick_sort(RandomAccessIterator start, RandomAccessIterator end,
+                         const Comp &comp, TaskGroup &tg, size_t depth = 0) {
+  // Do a sequential sort for small inputs.
----------------
Sean Silva wrote:
> Have you benchmarked this? I'm curious what the cutoff is before this starts beating std::sort. (especially libcxx's std::sort). I wouldn't be surprised if the cutoff is *very* large.
I benchmarked this with ParallelTest.cpp with array taking up 1GiB (that's as high as MSVC will go, even in x64 mode :(). This was the best cutoff for that specific case. I expect that it may vary widely depending on how expensive your comparator is. PPL actually takes this as a parameter to parallel_sort.

================
Comment at: include/lld/Core/Parallel.h:198
@@ +197,3 @@
+  // Do a sequential sort for small inputs.
+  if (std::distance(start, end) < detail::minParallelSize || depth > maxDepth) {
+    std::sort(start, end, comp);
----------------
Sean Silva wrote:
> This algorithm can go quadratic. You should fall back to std::sort if the depth becomes larger than logarithmic in the length of the range to be sorted to ensure n*log(n). Also, it would make sense to fall back to std::sort if the depth gets greater than log2(coreCount) to avoid increasing the parallelism beyond the number of cores.
Good idea on the max depth to use. As for the max task count, I found that limiting the max splits to the number of cores to be slower, as chunks take differing times to complete. Smaller chunks balance better.

================
Comment at: include/lld/Core/Parallel.h:213-219
@@ +212,9 @@
+
+  // Recurse.
+  tg.spawn([=, &tg] {
+    parallel_quick_sort(start, pivot, comp, tg, depth + 1);
+  });
+  tg.spawn([=, &tg] {
+    parallel_quick_sort(pivot + 1, end, comp, tg, depth + 1);
+  });
+}
----------------
Sean Silva wrote:
> You can "tail call" into one of these.
yep.


http://llvm-reviews.chandlerc.com/D649