[Openmp-commits] [openmp] [Libomptarget] Fix Nvidia offloading hanging on dataRetrieve using RPC (PR #66817)

Tue Sep 26 14:02:16 PDT 2023

https://github.com/jhuber6 updated https://github.com/llvm/llvm-project/pull/66817

>From bc82ee4dca9513923392118b6bc788bbb953a7e2 Mon Sep 17 00:00:00 2001
From: Joseph Huber <jhuber6 at vols.utk.edu>
Date: Tue, 19 Sep 2023 15:42:29 -0500
Subject: [PATCH] [Libomptarget] Fix Nvidia offloading hanging on dataRetrieve
 using RPC

Summary:
The RPC server is responsible for providing host services from the GPU.
Generally, the client running on the GPU will spin in place until the
host checks the server. Inside the runtime, we elected to have the user
thread do this checking while it would be otherwise waiting for the
kernel to finish. However, for Nvidia this caused problems when
offloading to a target region that requires a copy back.

This is caused by the implementation of `dataRetrieve` on Nvidia. We
initialize an asynchronous copy-back on the same stream that the kernel
is running on. This creates an implicit sync on the kernel to finish
before we issue the D2H copy, which we then wait on. This implicit sync
happens inside of the CUDA runtime. This is problematic when running the
RPC server because we need someone to check the RPC server. If no one
checks the RPC server then the kernel will never finish, meaning that
the memcpy will never be issued and the program hangs. This patch adds
an explicit check for unfinished work on the stream and waits for it to
complete.
---
 .../plugins-nextgen/cuda/src/rtl.cpp          | 13 ++++++++++-
 openmp/libomptarget/test/libc/fwrite.c        | 22 +++++++++++++++++++
 2 files changed, 34 insertions(+), 1 deletion(-)
 create mode 100644 openmp/libomptarget/test/libc/fwrite.c

diff --git a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
index 44b8d349033c0ff..70b412a6e4cd893 100644
--- a/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/cuda/src/rtl.cpp
@@ -576,7 +576,18 @@ struct CUDADeviceTy : public GenericDeviceTy {
     if (auto Err = getStream(AsyncInfoWrapper, Stream))
       return Err;
 
-    CUresult Res = cuMemcpyDtoHAsync(HstPtr, (CUdeviceptr)TgtPtr, Size, Stream);
+    // If there is already pending work on the stream it could be waiting for
+    // someone to check the RPC server.
+    CUresult Res = cuStreamQuery(Stream);
+    if (auto RPCServer = getRPCServer()) {
+      while (Res == CUDA_ERROR_NOT_READY) {
+        Res = cuStreamQuery(Stream);
+        if (auto Err = RPCServer->runServer(*this))
+          return Err;
+      }
+    }
+
+    Res = cuMemcpyDtoHAsync(HstPtr, (CUdeviceptr)TgtPtr, Size, Stream);
     return Plugin::check(Res, "Error in cuMemcpyDtoHAsync: %s");
   }
 
diff --git a/openmp/libomptarget/test/libc/fwrite.c b/openmp/libomptarget/test/libc/fwrite.c
new file mode 100644
index 000000000000000..7ffb449fee6672e
--- /dev/null
+++ b/openmp/libomptarget/test/libc/fwrite.c
@@ -0,0 +1,22 @@
+// RUN: %libomptarget-compile-run-and-check-generic
+
+// REQUIRES: libc
+
+#include <assert.h>
+#include <stdio.h>
+
+#pragma omp declare target to(stdout)
+
+int main() {
+  int r = 0;
+// CHECK: PASS
+#pragma omp target map(from : r)
+  { r = fwrite("PASS\n", 1, sizeof("PASS\n") - 1, stdout); }
+  assert(r == sizeof("PASS\n") - 1 && "Incorrect number of bytes written");
+
+// CHECK: PASS
+#pragma omp target map(from : r) nowait
+  { r = fwrite("PASS\n", 1, 5, stdout); }
+#pragma omp taskwait
+  assert(r == sizeof("PASS\n") - 1 && "Incorrect number of bytes written");
+}