[Openmp-commits] [PATCH] D132005: Add non-blocking support for target nowait regions

Wed Aug 17 20:40:56 PDT 2022

ye-luo added a comment.

In D132005#3730910 <https://reviews.llvm.org/D132005#3730910>, @gValarini wrote:

> In D132005#3730450 <https://reviews.llvm.org/D132005#3730450>, @ye-luo wrote:
>
>> Right now the synchronization is based on stream. Have you though about synchronize by an CUDA event and return the Stream to the pool early?
>
> I have not thought about that at the moment, but that could be a nice optimization. Since the CUDA plugin currently maintains a resizable pool of streams for each device with an initial size of 32, I thought that for a first implementation this could be enough.
>
> CUDA events have the same API as streams for non-blocking synchronization using cudaEventQuery <https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html#group__CUDART__EVENT_1g2bf738909b4a059023537eaa29d8a5b7>, so we could store a single event (`completionEvent`) per `AsyncInfo` and use that when synchronizing with `SyncType::NON_BLOCKING`. I have one question though: does querying for CUDA events completion synchronize all the operations prior to the event on the stream? Or another thread on the host must synchronize the stream? If only synchronizing the events is enough, it would make using them quite simpler.

My second thought on this is let us do Stream sync for now.
NVIDIA. If we sync with an event and return streams, tasks may got serialized if two happens in the same stream.
AMD, at the hsa level, there is only signals(events).
Level0, it depends on which type of commandlist being used.
So it seems at libomptarget, it should be flexible and let the plugin decide which mechanism to use.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132005/new/

https://reviews.llvm.org/D132005