[PATCH] D113857: [llvm-reduce] Add parallel chunk processing.

Wed Nov 17 10:55:31 PST 2021

Meinersbur added inline comments.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:42-46
+static cl::opt<unsigned int> NumJobs(
+    "j",
+    cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "
+             "disables parallelism. Maximum capped to 32."),
+    cl::init(1));
----------------
Remove the option unless LLVM is compiled with LLVM_ENABLE_THREADS? We will not get any parallelism otherwise.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:266
+
+    SmallVector<SmallString<0>> Results(NumJobs * 3, {});
+    for (auto I = ChunksStillConsideredInteresting.rbegin(),
----------------
[serious] `Results` is never resized to cover the last `NumChunksProcessed`, i.e. we get a buffer overflow. Instead, we could use a container that never invalidates its elements such as `std:deque` or use a circular buffer.

Alternatively, you can change `TaskQueue` to `std::dequeue<std::pair<std::shared_future<void>, ResultTy>>` to the task and its result in the same queue.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:306
+        //  * we have not reached the end of the chunk.
+        while (!TaskQueue.empty()) {
+          auto &F = TaskQueue.front();
----------------
This waits until the task queue is empty. Could we leave earlier as soon as a reduced version is available, and just forget the jobs not yet finished while scheduling the next ones?

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:317
+            {
+              std::unique_lock<std::mutex> Lock(AnyReducedMutex);
+              NotReduced = !AnyReduced;
----------------
For just a bool, maybe use `std::atomic<bool>` instead?

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:39-40
+    "max-chunk-threads",
+    cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "
+             "disables parallelism. Maximum capped to 32."),
+    cl::init(1));
----------------
fhahn wrote:
> Meinersbur wrote:
> > Why stop me from using 128 jobs on a 128 core machine?
> > 
> > Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful.
> > 
> > [suggestion] Use `-j` (for "jobs") shortcut for consistency with build tools such as ninja and make.
> > Why stop me from using 128 jobs on a 128 core machine?
> 
> My original intention was to avoid wasting resources in cases where we run a lot of parallel tasks, but only the first job already reduced the chunk.
> 
> I adjusted the task management now to schedule a number of initial tasks closer to the number of threads and then queue new jobs as you suggested. So maybe it is less of an issue now and the restriction can be dropped.
> 
> > [suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.
> 
> Done!
I still think it's the user's right to shoot themselves into the foot if they want to (it would be different if the number of jobs is determined by a heuristic). A maximum could be suggested in the description/documentation also explaining why. A hard limit blocks legitimate use cases.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:236
+        // corresponding Result element.
+        for (unsigned J = 0; J < NumTasks; ++J) {
+          Results[J] = "";
----------------
fhahn wrote:
> Meinersbur wrote:
> > Why not leave the number of tasks up to `ThreadPoolStrategy::apply_thread_strategy`?
> > 
> > Even if `--max-chunk-threads` is larger, the current approach is limited to what `ThreadPoolStrategy::compute_thread_count()` chooses.
> > 
> > [suggestion] Use `ThreadPool::async` `MaxChunkThreads` times and wait for the first one in the queue. If that one finishes, add another job using `ThreadPool::async` until either one successfully reduces or we are out of chunks still considered interesting.
> > Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?
> 
> Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.
> 
> > [suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.
> 
> That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. I also added a shared variable to indicate whether *any* task already reduced something, to avoid adding new jobs in that case.
> 
> This approach is slightly faster than the original patch for reducing `llvm/test/tools/llvm-reduce/operands-skip.ll` in parallel.
> That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something.

Nice! Thanks!

>>Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?
> Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.

I thought then we could override `apply_thread_strategy` to set either `NumThreads` or, if not number of threads is specified, call the inherited methods. But it's not-virtual and your approach `hardware_concurrency(NumJobs)` does about the same thing.

To get the number of effective jobs, we could call `compute_thread_count` of `ThreadPoolStrategy`.  This still applies if explicit with `hardware_concurrency(NumJobs)` since `compute_thread_count` caps it by the number of hardware threads.

Currently `-j<n>` requires the user to specify a number of jobs, but if we want have a heuristic, we could set `UseHyperThreads = false` without setting `ThreadsRequested`. This heuristically-determined number of jobs is what we might cap at 32. Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.

I'd remove the `+2` for `NumInitialTasks` as by `hardware_concurrency(NumJobs)`, there will not be more threads created for those two additional jobs.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:278
+        if (!Result)
+          I += NumTasks - 1;
+      } else {
----------------
fhahn wrote:
> Meinersbur wrote:
> > There are two places where `I` is incremented and far apart. Suggestion:
> > ```
> > for (auto I = ChunksStillConsideredInteresting.rbegin(),
> >               E = ChunksStillConsideredInteresting.rend();
> >          I != E; ) {
> >   bool UseThreadPool = MaxChunkThreads > 1 && WorkLeft > 4;
> >   int WorkItemsInThisJob = UseThreadPool ? WorkLeft : 1;
> >   ...
> >   I += WorkItemsInThisJob;
> > ```
> > 
> Thanks, I added a new `NumChunksProcessed`, which is set to 1 initially and then in the loop where we process the parallel results.
The updated patch still increments `I` in the for-statement parens as will as in the body. As reader of the code, I find this very unexpected. Since with the task queue, the `while` loop will always process all work items (or break on finding a reduction), there is not advantage in having them in the same loop anymore. Updated suggestion:
```
if (NumJobs > 1 && ChunksStillConsideredInteresting.size() > 1) {
  auto I = ChunksStillConsideredInteresting.begin();
  bool AnyReduced = false;
  do {
    if (!TaskQueue.empty()) {
      auto TaskAndResult = TaskQueue.pop_front_val();
      TaskAndResult.first.wait();
      if (TaskAndResult.second)
        return TaskAndResult.second; // Found a result
    }
    while (!AnyReduced && TaskQueue.size() < NumJobs && I != ChunksStillConsideredInteresting.end()) {
      TaskQueue.push_back(ChunkThreadPool.async(...));
       ++I;
    }
  } while(!TaskQueue.empty());
} else {
  for (Chunk &ChunkToCheckForUninterestingness : reverse(ChunksStillConsideredInteresting))
      std::unique_ptr<ReducerWorkItem> Result = CheckChunk(...);
}
```

This also uses the same code for adding the initial jobs and re-filling the queue.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113857/new/

https://reviews.llvm.org/D113857