[Openmp-commits] [PATCH] D132005: [OpenMP] Add non-blocking support for target nowait regions

Wed Sep 14 13:19:06 PDT 2022

gValarini marked 5 inline comments as done.
gValarini added inline comments.

================
Comment at: openmp/libomptarget/include/omptarget.h:194
+  using PostProcFuncTy = std::function<int()>;
+  llvm::SmallVector<PostProcFuncTy> PostProcessingFunctions;
+
----------------
jdoerfert wrote:
> gValarini wrote:
> > jdoerfert wrote:
> > > Drive by: I don't believe we want such a generic interface. The postprocessing task should just be a fixed function, not multiple unknown at compile time.
> > Just adding some context to why it was done this way:
> > 
> >   - Lambdas can easily store any local data needed by the post-processing procedures. This allows them to be generated locally by the normal code flow (which may improve code maintenance) and then stored in the lambdas implicit structure (solving any lifetime problems).
> >   - Having a function vector allows us to easily compose post-processing procedures. This is the case for `target nowait` regions, that should run both the `targetDataEnd` and `processDataAfter` post-processing functions.
> >   - If in the future we need to generate more asynchronous operations inside the post-processing functions, this can be easily done by pushing more lambdas to the vector.
> > 
> > With all that in mind, I know `std::function`s may lead to an additional heap allocation and one more level of indirection due to type-erasure, prohibiting some code opts. If that is not desirable despite the presented context, I can change how the post-processing procedures are stored/called, but I really would like to keep some of the points described above, especially the lifetime correctness of the post-processing data. Do you have something in mind that could help me with that? 
> > 
> > Maybe defining a different struct for each post-processing function and storing it in the `AsyncInfoTy` as a `variant` could be enough. What do you think?
> I would have assumed if we have 2 potential callbacks, let's say static functions `F` and `G` we would have two members, both void*.
> 
> ```
> void *PayloadForF = nullptr;
> void *PayloadForG = nullptr;
> ```
> 
> and if they are not null we call `F` and `G` respectively passing the payload.
> 
> I'm fine with keeping it for now this way, we can see if there is a need to change it.
> 
Okey, for now, I'll keep it like this then.

================
Comment at: openmp/libomptarget/include/omptargetplugin.h:160
+// asynchronous operations, marking the internal queue as nullptr when
+// completed.
+int32_t __tgt_rtl_synchronize_async(int32_t ID, __tgt_async_info *AsyncInfo);
----------------
jdoerfert wrote:
> Describe the return value, it's not clear what would be returned if it does or doesn't synchronize.
Good point. I have updated both the documentation and the function name to better reflect what this new interface should do. Do you think it is more clear now?

================
Comment at: openmp/libomptarget/src/interface.cpp:143
+    Rc = AsyncInfo->synchronize(SyncType);
+  handleTargetOutcome(Rc == OFFLOAD_SUCCESS, Loc);
+
----------------
jdoerfert wrote:
> I assume we can outline some of this as it is probably the same as in the sync version, right? Let's avoid duplication as much as possible. Same below.
Yep, you are correct. I have created two new "launchers" that can unify most of the code paths for the execution and data-related functions for the normal and nowait cases. Since all data-related interface entries practically have the same signature, a single launcher is enough for them all.

A new class called `TaskAsyncInfoTy` also unifies the code related to managing the task async handle when executing target nowait regions.

What do you think about this new code structure?

================
Comment at: openmp/libomptarget/src/interface.cpp:282
+                          ArgTypes, ArgNames, ArgMappers, *AsyncInfo);
+  }
+  if (Rc == OFFLOAD_SUCCESS)
----------------
jdoerfert wrote:
> So, when do we call this but don't want to actually do `targetDataUpdate`? I am confused. Same above.
The goal is to dispatch the device side operations (i.e., call the functions in the `omptarget.cpp` file) only when a new async handle is created. If we detect that the task already contains an async handle, that means that the device side operations were already dispatched and we should only try to synchronize it in a non-blocking manner.

The new `TaskAsyncInfoTy` class has a function called `shouldDispatch` that now encapsulates this detection logic with proper documentation. Do you think it is more clear now? Should we add a comment to each call site as well?

================
Comment at: openmp/libomptarget/src/omptarget.cpp:698

+static int
+postProcessingTargetDataEnd(DeviceTy *Device,
----------------
jdoerfert wrote:
> Documentation, plz.
Done. Could you check if I missed anything?

================
Comment at: openmp/libomptarget/src/omptarget.cpp:700
+postProcessingTargetDataEnd(DeviceTy *Device,
+                            SmallVector<PostProcessingInfo> PostProcessingPtrs,
+                            void * FromMapperBase) {
----------------
tianshilei1992 wrote:
> nit: a vector of `PostProcessingInfo` called `PostProcessingPtrs` is confusing.
Ops. Thanks, I have updated the var name.

================
Comment at: openmp/libomptarget/src/omptarget.cpp:1205
                           const map_var_info_t HstPtrName = nullptr)
-        : Index(Index), HstPtrBegin(reinterpret_cast<const char *>(HstPtr)),
+        : Index(Index), HstPtrBegin(reinterpret_cast<char *>(HstPtr)),
           HstPtrEnd(HstPtrBegin + Size), AlignedSize(Size + Size % Alignment),
----------------
jdoerfert wrote:
> Why are the const problematic?
In summary, we want to be able to move `PrivateArgumentManagerTy` instances into the post-processing lambdas, so their lifetime is automatically managed by them.

The problem is with how `llvm::SmallVector` implements its move constructor. Unfortunately, it is implemented as a move-assignment instead of a proper constructor, meaning it cannot be generated for structs with const members. If we replace `FirstPrivateArgInfo` with a `std::vector`, the problem does not happen because the STL properly implements a move constructor for vectors.

Since I think we do not want to use `std::vector` anymore, I just removed the const from the members, since they are not even accessible outside the `PrivateArgumentManagerTy` class.

What do you think of this approach?

================
Comment at: openmp/libomptarget/src/omptarget.cpp:1517
+        return Ret;
+      });

----------------
jdoerfert wrote:
> FWIW, `mutable` is really not my favorite way of handling things.
`mutable` was added because we need to call a non-const member function of `PrivateArgumentManager` (i.e., `free`).

I know that makes the lambda a function with an internal state since, multiple calls to it will generate different results, but I don´t know of another approach to it.

Maybe use `call_once` (IMHO, a little bit overkill) or remove the lambdas altogether and use another approach to store the post-processing functions and their payload. What do you think?

================
Comment at: openmp/runtime/src/kmp_tasking.cpp:1145
+      KMP_ATOMIC_DEC(&__kmp_unexecuted_hidden_helper_tasks);
+    }
   }
----------------
jdoerfert wrote:
> @tianshilei1992 you need to look at these changes.
Any comments on whether we can move the `__kmp_unexecuted_hidden_helper_tasks` decrement to this place?

================
Comment at: openmp/runtime/src/kmp_tasking.cpp:5157
+  void *handle = NULL;
+  kmp_info_t *thread = __kmp_thread_from_gtid(gtid);
+  kmp_taskdata_t *taskdata = thread->th.th_current_task;
----------------
tianshilei1992 wrote:
> `libomptarget` (for now) doesn't require the thread must be an OpenMP thread. Using `libomp`'s gtid generally breaks. Either we add the requirement, which needs to be discussed further, or there it needs an alternative method to implement that. If `libomp` is executed on another fresh thread, a new root will be created.
Uhm, I did not know about that. Although I think such a requirement makes sense, it may be out of the scope of this patch. 

What we could do is check if the current thread is registered inside `libomp` somehow, falling back to the current execution path that does not depend on the task team information. Do you know we can use `__kmpc_global_thread_num`'s return value to verify that? Maybe assert that returned `GTID` is valid and within a well-known range (e.g., [0, NUM_REGISTERED_OMP_THREADS]).

Just a note, NUM_REGISTERED_OMP_THREADS is not a valid variable. I just don't know where, or even if, such information is stored. Do you know where can I find this?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132005/new/

https://reviews.llvm.org/D132005