[Openmp-commits] [PATCH] D132005: [OpenMP] Add non-blocking support for target nowait regions

Tue Oct 18 09:14:31 PDT 2022

gValarini added a comment.

Added some more comments about the new execution flow and the thread id problem.

================
Comment at: openmp/libomptarget/include/device.h:441
+  /// succeeds/fails. Must be called multiple times until AsyncInfo is
+  /// completed and AsyncInfo.isDone() returns true.
+  int32_t queryAsync(AsyncInfoTy &AsyncInfo);
----------------
jdoerfert wrote:
> gValarini wrote:
> > jdoerfert wrote:
> > > Reading this I don't what this returns. SUCCESS if it completed and FAIL otherwise? Or FAIL only if something failed? Must be called multiple times is also unclear to me, I doubt that we should put that sentence here.
> > > 
> > > Update: I think I now understand the latter part but if I'm right we should change the interface.
> > > So, queryAsync is supposed to be called before isDone to make sure isDone returns the right value, correct? If so, we should not expose queryAsync to the user as there doesn't seem to be a reason to call it otherwise. Arguably, calling it doesn't provide information, just a state change, thus a secondary query is necessary.
> > I should probably update the documentation of this function to reflect the new one added to `omptargetplugin.h:__tgt_rtl_query_async`. That state the following:
> > 
> > > Queries for the completion of asynchronous operations. Instead of blocking the calling thread as __tgt_rtl_synchronize, the progress of the operations stored in AsyncInfo->Queue is queried in a non-blocking manner, partially advancing their execution. If all operations are completed, AsyncInfo->Queue is set to nullptr. If there are still pending operations, AsyncInfo->Queue is kept as a valid queue. In any case of success (i.e., successful query with/without completing all operations), return zero. Otherwise, return an error code.
> > > 
> > 
> > Thus, `queryAsync` (which calls `__tgt_rtl_query_async`), is a non-blocking version of `synchronize`. That means that we must call it multiple times until all operations are completed and the plugin invalidates the queue inside `AsyncInfo`. Here, `AsyncInfoTy::isDone` is just a helper function that indicates if the device side operations are completed or not based on said queue. We need to externalize `queryAsync` to its user `AsyncInfoTy` so it can call the non-blocking implementation.
> > 
> > Considering your comment, what do you think of making things more explicit by adding a flag pointer argument to `queryAsync` (and thus to __tgt_rtl_query_async) that returns true if all operations are completed and false otherwise?
> > 
> > ```
> >   int32_t queryAsync(AsyncInfoTy &AsyncInfo, bool &IsCompleted);
> > ```
> My point is: queryAsync is useless if not followed by an isDone, right?
> Why do we expose it in the first place? We should merge the two functions, potentially keeping isDone around, but at least avoiding the implicit dependence that you have to call one before the other for any meaningful use case. The updated interface basically does this. It merges the "isDone" query into this function, allowing users to call isDone standalone and this function standalone while getting meaningful results each time.
Uhm, I think I got your point. I'll update `AsyncInfoTy::isDone` so it can be called standalone without a prior call to `AsyncInfoTy::synchronize` (which calls the device `DeviceTy::queryAsync`). This indeed makes the interface better.

I was a little bit confused when you said `DeviceTy::queryAsync` should not be exposed, but now I got it.

================
Comment at: openmp/libomptarget/include/omptarget.h:217
   /// \returns OFFLOAD_FAIL or OFFLOAD_SUCCESS appropriately.
-  int synchronize();
+  int synchronize(SyncType SyncType = SyncType::BLOCKING);

----------------
gValarini wrote:
> gValarini wrote:
> > jdoerfert wrote:
> > > Nit: Rename argument to avoid shadowing.
> > > 
> > > Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one.
> > Thanks, I am renaming the type name to `SyncTypeTy` to reflect the other ones.
> > 
> > Regarding the second comment, I don´t quite understand what you mean with:
> > 
> > > IsDone can call the private version, users only the blocking one.
> > 
> > `isDone` only checks if the operations inside an `AsyncInfoTy` instance are completed or not, it does not call any plugin function at all. Are you suggesting that we move all the non-blocking synchronization code into `isDone`? If so, this means we would have some code duplication regarding the post-processing functions due to two separate synchronization paths, but if you think that is better I can do it.
> > 
> > Nit: Rename argument to avoid shadowing.
> > 
> > Make the version that takes the sync type, and probably the sync type, private. IsDone can call the private version, users only the blocking one.
> 
> 
Just a note, now I got the correct idea: we should make `isDone` a callable as a standalone function!

================
Comment at: openmp/libomptarget/src/private.h:227
+
+    delete AsyncInfo;
+    __kmpc_omp_set_target_async_handle(ExecThreadID, NULL);
----------------
jdoerfert wrote:
> gValarini wrote:
> > jdoerfert wrote:
> > > Should this be guarded by IsNew?
> > Nope, `IsNew` indicates that a new task-attached `AsyncInfo` has just been allocated, but not that we should deallocate it or not. The variable is primarily used to indicate that we must dispatch new operations to the new handle. Maybe I should rename it to just `ShouldDispatch`. Deallocation is always done when `AsyncInfo->isDone()` returns `true`, which is previously checked.
> I'm worried here that we might not delete the AsyncInfo or delete it multiple times. Are you saying there is exactly one TaskAsyncInfoTy that will own the AsyncInfor object at any given time? If not, how do we avoid double free?  
When executing a target nowait region, the actual owner of the `AsyncInfo` is the task itself. The structure is allocated when the task first executes and calls any of the libomptarget functions (`IsNew` is true) and it is deallocated when all the device-side operations are completed (`AsyncInfo::isDone` returns `true`).

Here, `TaskAsyncInfoTy` is just a wrapper around a task-owned `AsyncInfoTy` (stored inside the OpenMP task data) to mainly automate the allocation and deallocation logic. But following the OpenMP execution flow, since a task is owned and executed by only a single thread at any given time, only one `TaskAsyncInfoTy` will be managing the task-owned `AsyncInfoTy` object. This should avoid any double frees, but I understand this could be a weak assumption. If that is enough I could add documentation stating it, but probably having some code checks for that would be best. Maybe assertions at the task deallocation function ensuring no valid `AsyncInfoTy` address is left?

================
Comment at: openmp/libomptarget/src/private.h:240
+  /// synchronization is needed.
+  bool shouldDispatch() { return !IsNew; }
+};
----------------
jdoerfert wrote:
> gValarini wrote:
> > jdoerfert wrote:
> > > I don't understand what we want/need this dispatch idea for. It seems to skip operations but I don't understand how we would not forget about them and go back. 
> > Here is the main idea: 
> > 
> > # The first time a target nowait (target inside an OpenMP task) is executed, a new `AsyncInfo` handle is created and stored in the OpenMP taskdata structured of said task. Since this is the first time we are executing it (which is detected by the `IsNew` variable), we dispatch the device side operations and populate the post-processing function array by calling the proper `omptarget.cpp` functions (e.g., `targetDataBegin`, `targetDataEnd`, ...).
> > # Afterward, if the device operations are still pending, the OpenMP task is re-enqueued for execution.
> > # Later, when the task is re-executed, the same outline function called at step 1 will be called again. Here we can recover the `AsyncInfo` handle from the OpenMP taskdata and just synchronize it. Since this time the handle is not new, we know the operations were already dispatched previously and we should not dispatch them again.
> > 
> > I have a [[ https://docs.google.com/presentation/d/1yFaIanlR4JccYFhrKe_30k17LgjvyvC6DkOgc1Hum5c/edit?usp=sharing | presentation]] that explains it on slides 19-24, but I believe I am failing to describe that in the code. I'll try to come up with some better documentation for this dispatch/synchronize idea.
> Ok, that makes more sense. Now to help (even) me understand this, why do we need to call the functions from step 1 in step 3? We seem to use the "Dispatch" argument to skip most of what they do (the target data part of a targetDataXXX) anyway, no?
> 
That happens because the dispatch and synchronization logic are placed in the same interface function. The first call to that function done by a task dispatches the operations, while the subsequent calls try to do the non-blocking synchronization.

Maybe a better way of doing it would be to add a new interface function with the sole purpose of executing said synchronization. This way, when a task is re-executed, it calls this new function to only do the synchronization instead of the previous outline function. What do you think? This can better split the dispatch and synchronization code.

================
Comment at: openmp/runtime/src/kmp_tasking.cpp:5157
+  void *handle = NULL;
+  kmp_info_t *thread = __kmp_thread_from_gtid(gtid);
+  kmp_taskdata_t *taskdata = thread->th.th_current_task;
----------------
tianshilei1992 wrote:
> gValarini wrote:
> > gValarini wrote:
> > > tianshilei1992 wrote:
> > > > `libomptarget` (for now) doesn't require the thread must be an OpenMP thread. Using `libomp`'s gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If `libomp` is executed on another fresh thread, a new root will be created.
> > > Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch. 
> > > 
> > > What we could do is check if the current thread is registered inside `libomp` somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use `__kmpc_global_thread_num`'s return value to verify that? Maybe assert that returned `GTID` is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]).
> > > 
> > > Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this?
> > @tianshilei1992 any comments on this?
> IIRC some return values, except those real thread ids, are for specific purposes. That's one of the reason that I didn't use negative thread id for hidden helper thread. I don't know for sure if there is a value designed to say it is not an OpenMP managed thread. We can probably add one.
That would be perfect. I know we have `KMP_GTID_DNE` (value of `-2`) that represents the return value of a non-existent thread ID. My only problem is: do you know if `__kmpc_global_thread_num` returns that when called from a non-OpenMP thread? I'll do some local checking on that!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132005/new/

https://reviews.llvm.org/D132005