[PATCH] D36607: [Support/Parallel] - Do not spawn thread for single/last task.

Wed Aug 16 04:24:27 PDT 2017

davide added a comment.

In https://reviews.llvm.org/D36607#843163, @grimar wrote:

> In https://reviews.llvm.org/D36607#840916, @ruiu wrote:
>
> > I still wonder if a communication between threads can make something that slow. Where is the exact location where you observe the slowdown? Until we understand the exact reason, we cannot exclude the possibility that this change is hiding a real issue.
>
>
> It is ThreadPoolExecutor::work().
>  (https://github.com/llvm-mirror/llvm/blob/master/lib/Support/Parallel.cpp#L102)
>
> Currently it has following implementation:
>
>   void work() {
>     while (true) {
>       std::unique_lock<std::mutex> Lock(Mutex);
>       Cond.wait(Lock, [&] { return Stop || !WorkStack.empty(); });
>       if (Stop)
>         break;
>       auto Task = WorkStack.top();
>       WorkStack.pop();
>       Lock.unlock();
>       Task();
>     }
>     Done.dec();
>   }
>   
>
> If I modify it slightly to avoid waiting on condition variable (and use busy-waiting instead) then it works much faster for me:
>
>   void work() {
>       while (true) {
>         std::unique_lock<std::mutex> Lock(Mutex);
>         if (!WorkStack.empty()) {
>             auto Task = WorkStack.top();
>             WorkStack.pop();
>             Lock.unlock();
>             Task();
>         }
>         if (Stop)
>          break;
>       }
>       Done.dec();
>     }
>
>
> What make me think that sync overhead is significant in this case.

Adaptive mutexes (e.g. for pthread_*) work using this principle. The kernel maintains a shared page and threads contending on the lock spin until the lock owner isn't descheduled.
This works because in general the cost of a context switch overcomes that of wasting few cycles spinning (under the assumption the CS is small).
An approximation of this scheme would be that of spinning for a certain number of cycles decreasing a variable each time (and going to sleep when the variable reaches zero).
This is, FWIW, what FreeBSD pthread mutexes do.
A different (and possibly more effective) proposal for `Parallel` would be that of exploring work-stealing algorithms for queueing.

https://reviews.llvm.org/D36607