[Openmp-commits] [PATCH] D107656: [OpenMP] Use events and taskyield in target nowait task to unblock host threads

Joachim Protze via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Sun Aug 8 09:35:02 PDT 2021

protze.joachim added a comment.

In D107656#2933162 <https://reviews.llvm.org/D107656#2933162>, @ye-luo wrote:

>>> I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.
>> In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)
> 1. using hidden task or regular task is largely orthogonal to what we discussed here. Using hidden tasks is not a "must" for having efficient target nowait.
> 2. the current implementation calling cuStreamSynchronize already blocks the application thread. My implementation allows not being blocked.

Block the thread does not mean "keep the thread busy and eat all the core's cycles"

>>> In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.
>> Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
>> IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.
> I need it for production. The current "target nowait" has not been workable as expected.

I completely agree with the statement, that "target nowait" is not implemented in libomptarget. I just disagree with the way you suggest to fix the implementation.

>>> I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
>>> For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
>>> So this is quite involving and some dedicated person need to work on this and it needs time.
>> I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
>> If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:
>>   #pragma omp target enter data map(to:A) depend(inout:A) nowait
>>   #pragma omp target depend(inout:A) nowait
>>   #pragma omp target exit data map(from:A) depend(inout:A) nowait
> you need to decrease refcount and free memory if the count is 0 after the completion of all the asynchronous operations. If you can take care of that in the design, it is better to avoid asking extra work from users.
> Second. splitting in the way you suggested requires dependency resolution on the host at least right now. The added latency is a huge loss in performance.

I didn't suggest, that this should be a permanent solution. It might be an intermediate step until splitting the task into parts is implemented.
I think, @tianshilei1992 already has the code in place to handle these dependencies on the device.
Also, for the code example you posted, there will be no freeing of data at the end of the target region.

By mapping all data to the device you assume all data fits to the device at the same time. If you would remove your enter/exit data on (de)allocation and rely on the mapping for the target region to move the data, you would still not be able to process larger chunks of data. Because of the stacking nature of taskyield no data will be moved from the device before you finished all target regions.

>>> Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
>>> As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
>>> My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.
>> This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.
> The test code in the description is a distilled version of the app. I have slides and we can discuss them.

Thanks for pointing me to the link. The code convinced me even more, that taskyield is not the right solution even for your code example.

I also think. that without ignoring the task scheduling constraint, your code will only be able to schedule one task from the taskloop during your taskyield and a nested taskyield cannot schedule a task: 
When you reach the taskyield, you have a tied task from the taskloop scheduled in the barrier of the single region (or in the taskgroup for the task executing the taskloop). 
The target task is untied and does not count, so you can schedule another task from the taskloop.  Now, when you reach the taskyield, you have a tied task scheduled in the outer taskyield and none of the tasks from the taskloop can be scheduled.
You might mitigate this limitation by adding untied to the taskloop, with the cost of untied tasks and run into the stack problem of taskyield.



More information about the Openmp-commits mailing list