[PATCH] D148728: [Support][Parallel] Add sequential mode to TaskGroup::spawn().

Mon Apr 24 08:32:29 PDT 2023

avl added a comment.

In D148728#4292033 <https://reviews.llvm.org/D148728#4292033>, @andrewng wrote:

> I think it would be worthwhile adding a test that shows that sequential tasks are executed in the order of submission.

It looks like TaskGroupSequentialFor already do this?

================
Comment at: llvm/lib/Support/Parallel.cpp:16
 #include <future>
-#include <stack>
+#include <queue>
 #include <thread>
----------------
andrewng wrote:
> avl wrote:
> > avl wrote:
> > > andrewng wrote:
> > > > Unfortunately, at least in my testing on Windows, this change causes a drop in performance of around 3-4%. I believe that's why `std::stack` is used.
> > > > Unfortunately, at least in my testing on Windows, this change causes a drop in performance of around 3-4%. I believe that's why `std::stack` is used.
> > > 
> > > will check it.
> > >Unfortunately, at least in my testing on Windows, this change causes a drop in performance of around 3-4%. I believe that's why std::stack is used.
> > 
> > yes, there is small performance degradation in case std::queue is used instead of std::stack.
> > But, to have it noticeable, there should be created very large amount of tasks.
> > For the 100000000 tasks the std::queue works 8sec slower:
> > 
> > ```
> > 
> > llvm::parallel::TaskGroup tg;
> > for (size_t Idx = 0; Idx < 100000000; Idx++) {
> >   tg.spawn([](){});
> > }
> > 
> >                         Time(sec)  Memory(kB)
> > 
> > queue TaskGroup          121.28    441568
> > 
> > stack TaskGroup          113.61    471820
> > ```
> > 
> > Generally, it is a bad idea to create such amount of tasks.
> > parallelFor has optimization to not create a lot of tasks.
> > There is no any performance difference if parallelFor is used:
> > 
> > ```
> > 
> > parallelFor(0,100000000,[](size_t){});
> > 
> >                         Time(sec)  Memory(kB)
> > 
> > queue parallelFor          0.03    9128
> > 
> > stack parallelFor          0.03    9128
> > ```
> > 
> > So, probably using std::queue is OK.
> > 
> > Or, Dou you have a test case when using queue still bad idea?
> > So, probably using std::queue is OK.
> > 
> > Or, Dou you have a test case when using queue still bad idea?
> 
> I used my test case from D147493 which is a UE4 based link with a LLVM 16 `libc++` self-hosted release build with the `rpmalloc` allocator on Windows 10, AMD Ryzen 3900X 12C/24T, Samsung 970 EVO NVMe SSD, 64GB RAM. This link shows the 3-4% drop in performance which I think is significant enough to justify consideration.
> 
> One solution would be to use `std::deque` and add "work" to the front or back dependent on `Sequential`, e.g.
> ```
> void add(std::function<void()> F, bool Sequential = false) override {
>   {
>     std::lock_guard<std::mutex> Lock(Mutex);
>     if (Sequential)
>       WorkQueueSequential.emplace_front(std::move(F));
>     else
>       WorkQueue.emplace_back(std::move(F));
>   }
>   Cond.notify_one();
> }
> ```
> 
> ```
> void work(ThreadPoolStrategy S, unsigned ThreadID) {
>   ...
>     auto &Queue = Sequential ? WorkQueueSequential : WorkQueue;
>     auto Task = std::move(Queue.back());
>     Queue.pop_back();
>   ...
> }
> ```
> This change seems to restore the performance of my test link. What do you think?
> 
> TBH, I think we need to be a little bit careful with any further development of these `parallel` methods. IIRC, they were added to support basic and simple parallelism in LLD. I don't think the intention was ever for them to be general parallel algorithms for all general use cases, e.g. nested parallel for. I think my main concern is that making them more general will likely add threading overhead which in turn may impact the effectiveness and performance in LLD.
> 
> 
> One solution would be to use `std::deque` and add "work" to the front or back dependent on `Sequential`, e.g.
> ```
> void add(std::function<void()> F, bool Sequential = false) override {
>   {
>     std::lock_guard<std::mutex> Lock(Mutex);
>     if (Sequential)
>       WorkQueueSequential.emplace_front(std::move(F));
>     else
>       WorkQueue.emplace_back(std::move(F));
>   }
>   Cond.notify_one();
> }
> ```
> 
> ```
> void work(ThreadPoolStrategy S, unsigned ThreadID) {
>   ...
>     auto &Queue = Sequential ? WorkQueueSequential : WorkQueue;
>     auto Task = std::move(Queue.back());
>     Queue.pop_back();
>   ...
> }
> ```
> This change seems to restore the performance of my test link. What do you think?
> 
this is OK for me. will implement it.

By the way, as LLD is sensitive to changing stack to the queue, it might be beneficial to replace for loops with parallelFor some when.

```
  for (InputSectionBase *s : f->getSections()) {
  ....
  for (Partition &part : partitions)

```
> TBH, I think we need to be a little bit careful with any further development of these `parallel` methods. IIRC, they were added to support basic and simple parallelism in LLD. I don't think the intention was ever for them to be general parallel algorithms for all general use cases, e.g. nested parallel for. I think my main concern is that making them more general will likely add threading overhead which in turn may impact the effectiveness and performance in LLD.
> 

yep. My intention is not to support nested parallel for by `parallel` methods.  But allow to use these methods with other solutions. f.e. nested parallel for might be done using existing ThreadPool and basic `parallel` methods

```
std::function<void()> Fn = [&]() {
  parallelFor(Passes) {
  }
};

ThreadPool Pool;

for (Files) {
  Pool.async(Fn);

Pool.wait();
```

In that case `parallel` methods just work as simple parallel methods.

================
Comment at: llvm/lib/Support/Parallel.cpp:133
+        Lock.unlock();
+        Task();
+        SequentialQueueIsLocked = false;
----------------
andrewng wrote:
> avl wrote:
> > avl wrote:
> > > avl wrote:
> > > > andrewng wrote:
> > > > > MaskRay wrote:
> > > > > > After `Lock.unclok()`, another thread may grab the lock and run a task concurrently.
> > > > > Yes, this can happen but I believe the key point is that it can only be a thread with a different "index". However, one subtlety is that "sequential" tasks will run on different threads and thus use a different "index".
> > > > > After `Lock.unclok()`, another thread may grab the lock and run a task concurrently.
> > > > 
> > > > 
> > > > After `Lock.unclok()`, another thread may grab the lock and run a task concurrently.
> > > 
> > > After Lock.unlock(); the SequentialQueueIsLocked is true. Thus another thread could not start new sequential task until  SequentialQueueIsLocked becomes false. SequentialQueueIsLocked set to false when sequential task is finished. Thus there could not be two sequential tasks executing in parallel.
> > > Yes, this can happen but I believe the key point is that it can only be a thread with a different "index". However, one subtlety is that "sequential" tasks will run on different threads and thus use a different "index".
> > 
> > Yes, "sequential" tasks may be run on different threads and use a different "index". I supposed that this is not a problem.
> > Yes, "sequential" tasks may be run on different threads and use a different "index". I supposed that this is not a problem.
> 
> I don't think it is a problem as far as correctness, but might be somewhat unexpected.
will document that behavior.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148728/new/

https://reviews.llvm.org/D148728