[Openmp-commits] [PATCH] D107656: [OpenMP] Use events and taskyield in target nowait task to unblock host threads

Sun Aug 8 03:07:26 PDT 2021

ye-luo added a comment.

In D107656#2933103 <https://reviews.llvm.org/D107656#2933103>, @protze.joachim wrote:

> In D107656#2932956 <https://reviews.llvm.org/D107656#2932956>, @ye-luo wrote:
>
>> In D107656#2932908 <https://reviews.llvm.org/D107656#2932908>, @protze.joachim wrote:
>>
>>> Regarding the weird nature of taskyield I refer to https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_1
>>> Not everything in the paper is applicable for your situation. The most dangerous point I see here is, that taskyield if not used with care will effectively build a recursive call stack, so that a task that called taskyield can only proceed, if all recursively called task have finished.
>>
>> I don't have access to the paper but I do understand the case of "a recursive call stack". It can cause performance issues. It also seems like a feature of the taskyield implementation in LLVM libomp. So this is real.
>
> Lmgtfy: http://montblanc-project.eu/wp-content/uploads/2018/10/The-impact-of-taskyield-on.pdf

Thanks. It is consistent with my understanding of the "stack" implantation of taskyield.

>> I think there is another issue, when there is no available task in the queue at the point of yield. The target task will still behave blocking.
>
> In such case, you introduce busy waiting by polling on taskyield as long as target is not ready. Since the hidden tasks are pinned to the same cores as application threads, this will impact the performance of host threads. (Reject reason one)

1. using hidden task or regular task is largely orthogonal to what we discussed here. It is not a "must" for having efficient target nowait.
2. the current implementation calling cuStreamSynchronize already blocks the application thread. My implementation allows not being blocked.

>> In short, this implementation has limitations. However, it is not a big concern to me as my use pattern doesn't suffer much from these issues.
>
> Please add a mockup of your use pattern as a test case, so that we can review and understand your use pattern.
> IMHO, an implementation, where significant drawbacks can be expected should not go into mainline libomptarget just for experimenting with the performance.

I need it for production. The current "target nowait" has not been workable as expected.

>> I also agree that detached tasks has advantages. Details needs can only be sorted out when the implementation is done.
>> For example, in my understand, the target task needs to be broken into parts. The initial parts can be turned into detached tasks. The finalization parts needs to be a separate task depends on the detached task. Also some API changes is needed to pass the event token between libomp and libomptarget.
>> So this is quite involving and some dedicated person need to work on this and it needs time.
>
> I'm not sure what you mean with finalization. The only case, where I think a target task might need to get split into pieces is for mapping data from the device (not sure whether the internal signalling model allows to initiate the memory movement just after kernel offloading).
> If such splitting would be needed, we could limit the initial detach implementation to only support target regions without mapping at the end of the region. The application can always accomplish this requirement by splitting the mapping into separate directives:
>
>   #pragma omp target enter data map(to:A) depend(inout:A) nowait
>   #pragma omp target depend(inout:A) nowait
>   #pragma omp target exit data map(from:A) depend(inout:A) nowait

you need to decrease refcount and free memory if the count is 0 after the completion of all the asynchronous operations. If you can take care of that in the design, it is better to avoid asking extra work from users.
Second. splitting in the way you suggested requires dependency resolution on the host at least right now. The added latency is a huge loss in performance.

>> Right now my implementation using taskyield seems need very limited change and people can choose to opt-in to see if some performance can be gained.
>> As long as it doesn't contain functionality bugs like deadlock, I'd like to take advantage of this feature and move my application forward to prove OpenMP asynchronous offload works in real application.
>> My main job is on application and I had panic for years because of no decent "target nowait" support in LLVM. So get things moving is quite crucial.
>
> This seems like a change to address a very limited use case without explaining what the pattern of the use case actually is. We should discuss this in one of the upcoming calls.

The test code in the description is a distilled version of the app. I have slides and we can discuss them.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D107656/new/

https://reviews.llvm.org/D107656