[Openmp-commits] [PATCH] D104418: [OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement

Wed Jun 30 19:58:23 PDT 2021

tianshilei1992 added a comment.

In D104418#2851339 <https://reviews.llvm.org/D104418#2851339>, @gValarini wrote:

> In D104418#2850619 <https://reviews.llvm.org/D104418#2850619>, @tianshilei1992 wrote:
>
>> The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
>> Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
>> Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
>> You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.
>
> Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if `__tgt_rtl_wait_event` should be allowed to be blocking. Let me explain with my first question at #2850555 <https://reviews.llvm.org/D104418#2850555>.
>
>> In D104418#2850555 <https://reviews.llvm.org/D104418#2850555>, @gValarini wrote:
>>
>>> 1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (`__tgt_rtl_synchronize`)? **If yes, then the above problem can occur since no synchronization call is done.**
>>
>> I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.
>
> Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.
>
> If libomptarget follows such principles, `__tgt_rtl_wait_event` should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before `__tgt_rtl_synchronize` is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 <https://reviews.llvm.org/D104418#2850555> regarding possibles indirect synchronizations done by `__tgt_rtl_wait_event`.
>
>> In D104418#2850555 <https://reviews.llvm.org/D104418#2850555>, @gValarini wrote:
>>
>>> 2. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to `__tgt_rtl_wait_event` is made? **If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.**
>>
>> Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:
>>
>> - The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
>> - If any operation depends on the event, it will not be executed until the event is fulfilled/notified.
>
> Do these async principles make sense for libomptarget? What about restricting `__tgt_rtl_wait_event` to be only non-blocking and thus always be inserted in the async queue?

Actually, the function naming here is kind of confusing, although I inherited CUDA terminology. One of the most confusing thing is, "wait for an event" doesn't really mean "waiting", at least from host's perspective. Let me explain this here so it will become much clearer.

For CUDA, event wait and event synchronize are different. If you look at CUDA document (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g7840e3984799941a61839de40413d1d9), the function signature of event wait is `cudaStreamWaitEvent ( cudaStream_t stream, cudaEvent_t event, unsigned int  flags)`. It basically works "like" that insert event to the stream, and then return. It is non-blocking. GPU scheduler will make sure that all operations enqueued to the stream afterwards will not be executed until the event is fulfilled. Event synchronization is different. As shown here (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g949aa42b30ae9e622f6ba0787129ff22), it only accepts the event. So it simply blocks the host execution until the event is fulfilled.

So whether `__tgt_rtl_wait_event` is blocking or non-blocking is implementation dependent. For example, for a target that doesn't support the whole event mechanism we mentioned before, or it doesn't support async operation at all, it can implement the wait event to be blocking in the following way:

- An event is an atomic integer, initialized to 0;
- When the event is fulfilled (or notified), it simply sets it to 1;
- Wait for an event is simply equivalent to set a loop until it becomes 1.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104418/new/

https://reviews.llvm.org/D104418