[Openmp-commits] [PATCH] D107656: [OpenMP] Use events and taskyield in target nowait task to unblock host threads

Ye Luo via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Sat Aug 7 12:53:48 PDT 2021

ye-luo added a comment.

In D107656#2932820 <https://reviews.llvm.org/D107656#2932820>, @protze.joachim wrote:

> Why do you want to use taskyield?

Right now, there is a performance issue of target task blocking host thread while waiting for the device to complete. I want the target task got suspended after kernel launch and the host thread continue to progress other tasks. This patch makes it working well in my use cases. 
On NVIDIA with the exisitng implemenatiojn, host threads are spinning at cuStreamSynchronize regardless of using hidden helper tasks or not.
Such synchronization call may be replaced with other smart schemes but it doesn't change the nature that target task is blocking a thread regardless of OpenMP threads or hidden helper threads.

> The semantics of taskyield are weird and not useful in so many cases.

Please elaborate why weird.  Is there any logic holes in my implementation?
I never claim it is a one method for all cases and it is also added as an option.
If taskyield can be called inside a regular task, is there any reason not allowing it inside the target task?

> I think, it would make much more sense to adopt the notion of detached tasks instead and call omp_fulfill_event to complete the hidden helper task once the device is done.

That is an optimization to the hidden helper task. I'm happy to see it being implemented. In my understanding, implementing the whole target task as a detached task doesn't resolve the issue of task blocking thread. You may rely on OS to switching threads to gain something since these are hidden helper threads. You may also suffer from the nature of thread over-subscription when regular OpenMP threads already occupy all the cores. The are many things can be discussed in this topic but I would like to pull helper tasks out of my equation and put it aside.

IMO, to have an efficient implementation of "target nowait", breaking up its operation seems necessary and the breakup needs to happen after enqueuing kernels and transfers before other operations like decrease reference counting, free memory.

I desperately need a working implementation of target nowait for my app.  I have one and my work can be unblocked.
The hidden helper tasks is presenting functionality issue to me and I don't have any answer for its performance.
Please keep improving hidden helper tasks can we can compare and have better understanding.
I will be happy with one scheme fits all but I don't think there is one right now and that is why we are exploring several schemes.



More information about the Openmp-commits mailing list