[Openmp-commits] [PATCH] D104418: [OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement

Wed Jun 30 14:28:31 PDT 2021

gValarini added a comment.

In D104418#2850619 <https://reviews.llvm.org/D104418#2850619>, @tianshilei1992 wrote:

> The case you mentioned is not a problem. Say we have two queues for the two target regions. So the contents of each queue is:
> Q1: data movement of A, notification of event E1, wait on E2, kernel issue.
> Q2: data movement of B, notification of event E2, wait on E1, kernel issue.
> You can see the notification of the two events always before the wait. No matter what execution orders they will be, it will not dead lock.

Uhm, I think I misunderstood how the events would interact with the queues. But now I have a doubt if `__tgt_rtl_wait_event` should be allowed to be blocking. Let me explain with my first question at #2850555 <https://reviews.llvm.org/D104418#2850555>.

> In D104418#2850555 <https://reviews.llvm.org/D104418#2850555>, @gValarini wrote:
>
>> 1. Can a plugin implementation defer any command execution until the agnostic layer calls the device synchronization function (`__tgt_rtl_synchronize`)? **If yes, then the above problem can occur since no synchronization call is done.**
>
> I'm not sure I follow. Plugin will never defer any operation. If the target supports async operations, the plugin just issues the operation, and that's it. As for whether the device will defer or not, it's plugin transparent.

Some of the async APIs that I am most used to (like the C++ async library and the non-blocking MPI functions) have the restriction that one should never consider any computation/communication to be done (or even in-flight) until their respective synchronization functions are properly called. This is needed considering that a functional (but not optimized) implementation of those async APIs would be to defer their execution until such synchronization point. If I regard it correctly, this restriction is even present in the CUDA streams specification, but please don't quote me on that since I am not a CUDA expert.

If libomptarget follows such principles, `__tgt_rtl_wait_event` should not be allowed to be blocking, since that would imply that the "notification of event E#" that you described could be assumed to be completed before `__tgt_rtl_synchronize` is called, which would be a little bit strange considering the async API. This behavior was what even led me to my second question at #2850555 <https://reviews.llvm.org/D104418#2850555> regarding possibles indirect synchronizations done by `__tgt_rtl_wait_event`.

> In D104418#2850555 <https://reviews.llvm.org/D104418#2850555>, @gValarini wrote:
>
>> 2. Should a plugin implementation indirectly synchronize the queue where an event was created when a call to `__tgt_rtl_wait_event` is made? **If yes, then the problem above won't happen, but this should be clarified in the documentation of the API.**
>
> Creating an event just needs to "take a snapshot" of current queue. It doesn't need to synchronize. We just need to make sure that:
>
> - The event will be notified only if all operations in the "snapshot" are finished, or say all operations before it are finished.
> - If any operation depends on the event, it will not be executed until the event is fulfilled/notified.

Do these async principles make sense for libomptarget? What about restricting `__tgt_rtl_wait_event` to be only non-blocking and thus always be inserted in the async queue?

By the way, thanks for the clarifications, the event API is much clear to me now.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104418/new/

https://reviews.llvm.org/D104418