[Openmp-commits] [PATCH] D154312: [Libomptarget] Begin implementing support for RPC services

Mon Jul 3 13:32:59 PDT 2023

jhuber6 added inline comments.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:533
+      return Plugin::success();
+    }
+
----------------
jdoerfert wrote:
> jhuber6 wrote:
> > jdoerfert wrote:
> > > This I don't get. We wait for 8k units, then we run the server once, then we wait again if the kernel is not done. Why isn't this part of the loop below with a timeout that adjusts based on the RPC needs?
> > I wasn't sure how to mesh it with the existing logic, since we will presumably have streams that didn't come from kernel launches and we have some active timeout  exception here where I guess we query at a higher rate compared to the blocking one? I just worked around it but I could try to integrate it. AMDGPU was much more difficult here than CUDA because of the indirection.
> I'm not convinced the thread starting the thing should run the server, but there is merit to having this option. However, the timeout still seems wrong to me. The fact that we have two locations we wait on (finish and RPC) is also not great.
Yeah, the alternative is to run a dedicated server thread, but I asked Shilei and he said that this should be fine. When running the thread I ran into problems, e.g. a synchronous kernel can exit without waiting for any async RPC calls to finish (exit, free, (maybe an async print variant?). Secondly when exiting it was tough to tell when the thread was alive or who should clean things up.

================
Comment at: openmp/libomptarget/plugins-nextgen/common/PluginInterface/RPC.cpp:69-104
+  // Register a custom opcode handler to perform plugin specific allocation.
+  // FIXME: We need to make sure this uses asynchronous allocations on CUDA.
+  auto MallocHandler = [](rpc_port_t Port, void *Data) {
+    rpc_recv_and_send(
+        Port,
+        [](rpc_buffer_t *Buffer, void *Data) {
+          plugin::GenericDeviceTy &Device =
----------------
jdoerfert wrote:
> jhuber6 wrote:
> > jdoerfert wrote:
> > > Can the GenericDevice register these handlers?
> > > A vector of them somewhere, or just let it call a register callback.
> > > I would not make these special and define them here.
> > > Move all into the GenericDevice and it will call an Impl method for the subclasses to register more.
> > So this will probably only be necessary for things like malloc / memcpy. Figured those are already provided by the generic device so we should be able to hide that complexity in here right next to the implementation. Didn't feel like it warranted abstraction.
> But this takes away the extension point, and splits the callbacks. We might want to use different mallocs per kernel. And having all callbacks in "one place" seems like a good idea as well.
I figured that this portion of the malloc would be pretty straightforward, since I think its usage should be more similar to setting the breakpoint, i.e. we only do the RPC path if we run out of some preallocated buffer space.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154312/new/

https://reviews.llvm.org/D154312