[PATCH] D123225: [ThreadPool] add ability to group tasks into separate groups

Mon Apr 18 10:29:46 PDT 2022

llunak marked an inline comment as done.
llunak added inline comments.

================
Comment at: llvm/include/llvm/Support/ThreadPool.h:94
+  /// It is possible to recursively wait even inside a task, if the group is
+  /// different. There may be only one active wait() call for a given group.
+  /// It is possible to add new tasks while blocking on this call, if those
----------------
mehdi_amini wrote:
> llunak wrote:
> > mehdi_amini wrote:
> > > Do we have a way to assert on this?
> > Not without extra variables added just for detecting that. Is that something that would be done in LLVM code?
> > 
> Yes it is common done, we guard such code in header within `#if LLVM_ENABLE_ABI_BREAKING_CHECKS`, you can grep the code base for examples (here is one: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Analysis/LoopInfo.h#L83-L87 )
I see. I've added an assert for self-wait, but I actually see no good reason to enforce that there can be only one wait() for a group, so I'll instead remove that comment.

================
Comment at: llvm/include/llvm/Support/ThreadPool.h:106
   bool isWorkerThread() const;

 private:
----------------
mehdi_amini wrote:
> llunak wrote:
> > clayborg wrote:
> > > mehdi_amini wrote:
> > > > I think you need more documentation for the groups, at minima:
> > > > - in particular ownership model / lifetime expectation for the `TaskGroup` objects. 
> > > > - the existing APIs that don't take group should like be updated to be documented with respect to groups (is there a concept of a "default group" represented by the nullptr?).
> > > > 
> > > That would be good. Maybe add headerdoc before the TaskGroup class. If we end up making the TaskGroup constructor take a reference to the ThreadPool, we should mention that the ThreadPool must outlive the TaskGroup as far as lifetime goes. More documentation or examples would be nice for:
> > > - Making a TaskGroup 1 and then during work for TaskGroup 1 creating a TaskGroup 2 and then waiting on that within a worker thread
> > > - details on if you can add to a group while work is ongoing for that group
> > > 
> > I'm confused about this part about documentation, because I think all of this is either documented or obvious. Did you miss those or are they not as obvious as I find them to be? It seems to me that you expect this change to be more complex than it is - it's just an ability to tag tasks with a TaskGroup pointer and then wait on tasks with a specific tag. The only somewhat complex thing here is making sure wait() called from within a task doesn't not deadlock, and that's not an API thing.
> > 
> > > - in particular ownership model / lifetime expectation for the `TaskGroup` objects.
> > 
> > The TaskGroup objects are passed by reference, so I don't see how there could be any ownership. And TaskGroup objects are groups, so lifetime of TaskGroup objects and lifetime of groups are the same thing.
> > 
> > > - the existing APIs that don't take group should like be updated to be documented with respect to groups (is there a concept of a "default group" represented by the nullptr?).
> > 
> > Tasks without a group are tasks without a group. It really matters only for wait(), which already says that it waits for all threads.
> > 
> > > - Making a TaskGroup 1 and then during work for TaskGroup 1 creating a TaskGroup 2 and then waiting on that within a worker thread
> > 
> > This is done by simply doing it, there's nothing special about it from the API point of view, and wait() documentation says that this is possible. What more documentation would you need?
> > 
> > > - details on if you can add to a group while work is ongoing for that group
> > 
> > The description of wait() already says that no. What other details do you need?
> > 
> > I'm confused about this part about documentation, because I think all of this is either documented or obvious. Did you miss those or are they not as obvious as I find them to be?
> 
> I think there is always a natural tendency to find things "obvious" when we design some thing with a mental model and other assumptions in mind. But someone coming there new with fresh eyes won't find things as obvious.
> 
> 
> > in particular ownership model / lifetime expectation for the TaskGroup objects.
> > The TaskGroup objects are passed by reference, so I don't see how there could be any ownership. And TaskGroup objects are groups, so lifetime of TaskGroup objects and lifetime of groups are the same thing.
> 
> Right but that has implication on the lifetime: for example you should destroy a group while there are still tasks in flight using this group. I think this particular point isn't a concern anymore now that you call wait() in the ThreadPoolTaskGroup destructor.
> 
> > Making a TaskGroup 1 and then during work for TaskGroup 1 creating a TaskGroup 2 and then waiting on that within a worker thread
> > This is done by simply doing it, there's nothing special about it from the API point of view, and wait() documentation says that this is possible. What more documentation would you need?
> 
> Waiting within a worker thread wasn't allowed until now: it is again more of a question of forming a mental model about the system. 
> Adding this concept of groups (which I welcome: it is solving a major limitation of the pool) is really making the existing model of a simple ThreadPool no longer as "simple", hence why I'm may seem overly cautious :)
> 
> Some things can be surprising as well for someone who does not really know or think about the implementation details but something like this:
> 
> ```
> group1.async([&]() {
>   auto f = group2.async([&]() {
>     return 1;
>   });
>   f.wait(); // May deadlock
> })
> ```
> 
> but:
> 
> 
> ```
> group1.async([&]() {
>   auto f = group2.async([&]() {
>     return 1;
>   });
>   group2.wait();  // yield current thread to run tasks from group2 if needed
>   f.wait(); // May not deadlock
> })
> ```
> 
> Not that I have a great suggestion on how to express the system and its invariant concisely, but this isn't easy to figure all of this out without reading carefully the implementation right now.
> (and to be fair, even the "simple" pre-existing model isn't carefully documented in the API)
> 
I think the concept is still fairly simple, as long as one doesn't forget that waiting for oneself will deadlock. But I have added some comments saying this explicitly, I hope that's enough.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D123225/new/

https://reviews.llvm.org/D123225