[PATCH] D113857: [llvm-reduce] Add parallel chunk processing.

Thu Nov 18 13:49:58 PST 2021

fhahn added inline comments.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:42-46
+static cl::opt<unsigned int> NumJobs(
+    "j",
+    cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "
+             "disables parallelism. Maximum capped to 32."),
+    cl::init(1));
----------------
Meinersbur wrote:
> Remove the option unless LLVM is compiled with LLVM_ENABLE_THREADS? We will not get any parallelism otherwise.
Updated! If `LLVM_ENABLE_THREADS` is not set `NumJobs` is defined as `unsigned` with a value of 1.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:306
+        //  * we have not reached the end of the chunk.
+        while (!TaskQueue.empty()) {
+          auto &F = TaskQueue.front();
----------------
Meinersbur wrote:
> fhahn wrote:
> > Meinersbur wrote:
> > > This waits until the task queue is empty. Could we leave earlier as soon as a reduced version is available, and just forget the jobs not yet finished while scheduling the next ones?
> > The current approach waits for each task, until it finds the first one that leads to a reduction. 
> > 
> > 
> > It intentionally does not try to stop once any job reduces a chunk, because if we pick any reduced chunk we might get different results than during serial reduction.
> Assume the current content of TaskQueue
> 
> 0: job finished, no valid reduction
> 1: job still running
> 2: job finished, valid reduction result
> 3: job still running
> 
> I understood that the intention of AnyReduced is to not queue additional jobs (assuming `NumJobs > 4`) since we know that when hitting 2, we will have a reduced result. We still need to wait for job 1 which may also compute a valid reduction.
> I think this is clever. I had `std::future<ResultTy>` in mind as you proposed yourself, until I discovered that `llvm::ThreadPool` only supports a void return type :-(. However, to avoid submitting more jobs after we know that there will a result available, I think a `AnyReduced` would still be useful flag.
> 
> My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.
I updated the code to communicate the result via the `shared_future`. This depends on D114183 now.

> My comment was regarding index 3 in the queue. Assuming job 1 finishes and job 2 becomes the front of the queue, do we still need to wait for job 3 before submitting jobs based on the new intermediate result? It may cause issues with who becomes responsible to free resources, so I am not sure its feasible.

Oh I see, you are thinking about submitting jobs already with the updated state? At the moment I am not really sure what kind of issues that would bring and if it is feasible. I think it would be good to try to keep the initial version as simple as possible.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:317
+            {
+              std::unique_lock<std::mutex> Lock(AnyReducedMutex);
+              NotReduced = !AnyReduced;
----------------
Meinersbur wrote:
> For just a bool, maybe use `std::atomic<bool>` instead?
Done, thanks!

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:39-40
+    "max-chunk-threads",
+    cl::desc("Maximum number of threads to use to process chunks. Set to 1 to "
+             "disables parallelism. Maximum capped to 32."),
+    cl::init(1));
----------------
Meinersbur wrote:
> fhahn wrote:
> > Meinersbur wrote:
> > > Why stop me from using 128 jobs on a 128 core machine?
> > > 
> > > Could warn about diminishing returns because or reduced results being discarded, i.e. explointing SMT not that useful.
> > > 
> > > [suggestion] Use `-j` (for "jobs") shortcut for consistency with build tools such as ninja and make.
> > > Why stop me from using 128 jobs on a 128 core machine?
> > 
> > My original intention was to avoid wasting resources in cases where we run a lot of parallel tasks, but only the first job already reduced the chunk.
> > 
> > I adjusted the task management now to schedule a number of initial tasks closer to the number of threads and then queue new jobs as you suggested. So maybe it is less of an issue now and the restriction can be dropped.
> > 
> > > [suggestion] Use -j (for "jobs") shortcut for consistency with build tools such as ninja and make.
> > 
> > Done!
> I still think it's the user's right to shoot themselves into the foot if they want to (it would be different if the number of jobs is determined by a heuristic). A maximum could be suggested in the description/documentation also explaining why. A hard limit blocks legitimate use cases.
Sounds good, I dropped the limit.

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:236
+        // corresponding Result element.
+        for (unsigned J = 0; J < NumTasks; ++J) {
+          Results[J] = "";
----------------
Meinersbur wrote:
> fhahn wrote:
> > Meinersbur wrote:
> > > Why not leave the number of tasks up to `ThreadPoolStrategy::apply_thread_strategy`?
> > > 
> > > Even if `--max-chunk-threads` is larger, the current approach is limited to what `ThreadPoolStrategy::compute_thread_count()` chooses.
> > > 
> > > [suggestion] Use `ThreadPool::async` `MaxChunkThreads` times and wait for the first one in the queue. If that one finishes, add another job using `ThreadPool::async` until either one successfully reduces or we are out of chunks still considered interesting.
> > > Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?
> > 
> > Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.
> > 
> > > [suggestion] Use ThreadPool::async MaxChunkThreads times and wait for the first one in the queue. If that one finishes, add another job using ThreadPool::async until either one successfully reduces or we are out of chunks still considered interesting.
> > 
> > That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something. I also added a shared variable to indicate whether *any* task already reduced something, to avoid adding new jobs in that case.
> > 
> > This approach is slightly faster than the original patch for reducing `llvm/test/tools/llvm-reduce/operands-skip.ll` in parallel.
> > That sounds like a good idea, thanks! I updated the code to work along those lines, unless I missed something.
> 
> Nice! Thanks!
> 
> >>Why not leave the number of tasks up to ThreadPoolStrategy::apply_thread_strategy?
> > Do you know how to do this? It seems like the thread pool constructor expects a strategy to be passed in.
> 
> I thought then we could override `apply_thread_strategy` to set either `NumThreads` or, if not number of threads is specified, call the inherited methods. But it's not-virtual and your approach `hardware_concurrency(NumJobs)` does about the same thing.
> 
> To get the number of effective jobs, we could call `compute_thread_count` of `ThreadPoolStrategy`.  This still applies if explicit with `hardware_concurrency(NumJobs)` since `compute_thread_count` caps it by the number of hardware threads.
> 
> Currently `-j<n>` requires the user to specify a number of jobs, but if we want have a heuristic, we could set `UseHyperThreads = false` without setting `ThreadsRequested`. This heuristically-determined number of jobs is what we might cap at 32. Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.
> 
> I'd remove the `+2` for `NumInitialTasks` as by `hardware_concurrency(NumJobs)`, there will not be more threads created for those two additional jobs.
> Since I am not sure this is a good heuristic, we may keep requiring the user to specify the number of jobs.

I think for now requiring a user specified value is the easiest.

> I'd remove the +2 for NumInitialTasks as by hardware_concurrency(NumJobs), there will not be more threads created for those two additional jobs.

The main intention was to have. a few more jobs to fill threads when jobs exit earlier, but I think it could do more bad than good. I removed it

================
Comment at: llvm/tools/llvm-reduce/deltas/Delta.cpp:278
+        if (!Result)
+          I += NumTasks - 1;
+      } else {
----------------
Meinersbur wrote:
> fhahn wrote:
> > Meinersbur wrote:
> > > There are two places where `I` is incremented and far apart. Suggestion:
> > > ```
> > > for (auto I = ChunksStillConsideredInteresting.rbegin(),
> > >               E = ChunksStillConsideredInteresting.rend();
> > >          I != E; ) {
> > >   bool UseThreadPool = MaxChunkThreads > 1 && WorkLeft > 4;
> > >   int WorkItemsInThisJob = UseThreadPool ? WorkLeft : 1;
> > >   ...
> > >   I += WorkItemsInThisJob;
> > > ```
> > > 
> > Thanks, I added a new `NumChunksProcessed`, which is set to 1 initially and then in the loop where we process the parallel results.
> The updated patch still increments `I` in the for-statement parens as will as in the body. As reader of the code, I find this very unexpected. Since with the task queue, the `while` loop will always process all work items (or break on finding a reduction), there is not advantage in having them in the same loop anymore. Updated suggestion:
> ```
> if (NumJobs > 1 && ChunksStillConsideredInteresting.size() > 1) {
>   auto I = ChunksStillConsideredInteresting.begin();
>   bool AnyReduced = false;
>   do {
>     if (!TaskQueue.empty()) {
>       auto TaskAndResult = TaskQueue.pop_front_val();
>       TaskAndResult.first.wait();
>       if (TaskAndResult.second)
>         return TaskAndResult.second; // Found a result
>     }
>     while (!AnyReduced && TaskQueue.size() < NumJobs && I != ChunksStillConsideredInteresting.end()) {
>       TaskQueue.push_back(ChunkThreadPool.async(...));
>        ++I;
>     }
>   } while(!TaskQueue.empty());
> } else {
>   for (Chunk &ChunkToCheckForUninterestingness : reverse(ChunksStillConsideredInteresting))
>       std::unique_ptr<ReducerWorkItem> Result = CheckChunk(...);
> }
> ```
> 
> This also uses the same code for adding the initial jobs and re-filling the queue.
>   This also uses the same code for adding the initial jobs and re-filling the queue.

Ah, thanks for elaborating! I have not updated the code so far, because I am not yet sure how to best include the processing of the found result. When splitting it into 2 separate loops, I am not sure how to best continue processing the next chunk after the reduced one.

I could merge the 2 loops in the current implementation though if you think it's easier to read (the one that schedules the initial jobs and then processes the queue)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113857/new/

https://reviews.llvm.org/D113857