[PATCH] D113857: [llvm-reduce] Add parallel chunk processing.

Wed Nov 17 04:26:33 PST 2021

fhahn marked an inline comment as done.
fhahn added inline comments.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:39-40
+    "max-chunk-threads",
+    cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "
+             "disables parallelism. Maximum capped to 32."),
+    cl::init(1));
----------------
Meinersbur wrote:
> Why stop me from using 128 jobs on a 128 core machine?
> 
> Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful.
> 
> [suggestion] Use `-j` (for "jobs") shortcut for consistency with build tools such as ninja and make.
> Why stop me from using 128 jobs on a 128 core machine?

My original intention was to avoid wasting resources in cases where we run a lot of parallel tasks, but only the first job already reduced the chunk.

I adjusted the task management now to schedule a number of initial tasks closer to the number of threads and then queue new jobs as you suggested. So maybe it is less of an issue now and the restriction can be dropped.

> [suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.

Done!

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:227
+      // there are at least a few chunks to process.
+      if (MaxChunkThreads > 1 && WorkLeft > 4) {
+        NumTasks = std::min(WorkLeft, unsigned(MaxChunkThreads));
----------------
Meinersbur wrote:
> Why require at least 5 parallel jobs? Synchronization overhead?
Yes, but a smaller limit may be good as well. In the current version it is just 2 tasks.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:228
+      if (MaxChunkThreads > 1 && WorkLeft > 4) {
+        NumTasks = std::min(WorkLeft, unsigned(MaxChunkThreads));
+
----------------
Meinersbur wrote:
> `NumTasks` is originally assigned `MaxChunkThreads * 2` but here it is overwritten again with a different value?
this is now gone

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:236
+        // corresponding Result element.
+        for (unsigned J = 0; J < NumTasks; ++J) {
+          Results[J] = "";
----------------
Meinersbur wrote:
> Why not leave the number of tasks up to `ThreadPoolStrategy::apply_thread_strategy`?
> 
> Even if `--max-chunk-threads` is larger, the current approach is limited to what `ThreadPoolStrategy::compute_thread_count()` chooses.
> 
> [suggestion] Use `ThreadPool::async` `MaxChunkThreads` times and wait for the first one in the queue. If that one finishes, add another job using `ThreadPool::async` until either one successfully reduces or we are out of chunks still considered interesting.
> Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?

Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.

> [suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.

That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. I also added a shared variable to indicate whether *any* task already reduced something, to avoid adding new jobs in that case.

This approach is slightly faster than the original patch for reducing `llvm/test/tools/llvm-reduce/operands-skip.ll` in parallel.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:249
+
+            auto Res = CheckChunk(*(I + J), std::move(CloneMMM));
+            if (Res) {
----------------
Meinersbur wrote:
> We have `Result` and `Res` in the same scope. Better names?
Thanks, I updated it to `ChunkResult`.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:278
+        if (!Result)
+          I += NumTasks - 1;
+      } else {
----------------
Meinersbur wrote:
> There are two places where `I` is incremented and far apart. Suggestion:
> ```
> for (auto I = ChunksStillConsideredInteresting.rbegin(),
>               E = ChunksStillConsideredInteresting.rend();
>          I != E; ) {
>   bool UseThreadPool = MaxChunkThreads > 1 && WorkLeft > 4;
>   int WorkItemsInThisJob = UseThreadPool ? WorkLeft : 1;
>   ...
>   I += WorkItemsInThisJob;
> ```
> 
Thanks, I added a new `NumChunksProcessed`, which is set to 1 initially and then in the loop where we process the parallel results.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113857/new/

https://reviews.llvm.org/D113857