[Openmp-commits] [PATCH] D14253: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget plugins.

Mon Dec 5 00:51:47 PST 2016

Hahnfeld added inline comments.

================
Comment at: libomptarget/CMakeLists.txt:107
+  # Build offloading plugins and device RTLs if they are available.
+  add_subdirectory(plugins)
+  add_subdirectory(deviceRTLs)
----------------
Only this line should be in the diff

================
Comment at: libomptarget/plugins/CMakeLists.txt:40
+      install(TARGETS "omptarget.rtl.${tmachine_libname}" 
+        LIBRARY DESTINATION "lib")
+        
----------------
This should also take `LLVM_LIBDIR_SUFFIX` into account

================
Comment at: libomptarget/plugins/cuda/CMakeLists.txt:22
+    
+      if(CMAKE_BUILD_TYPE MATCHES Debug)
+          add_definitions(-DCUDA_ERROR_REPORT)
----------------
`LIBOMPTARGET_CMAKE_BUILD_TYPE` to match against lowercase?

================
Comment at: libomptarget/plugins/cuda/CMakeLists.txt:31
+      # Install plugin under the lib destination folder.
+      install(TARGETS omptarget.rtl.cuda LIBRARY DESTINATION "lib")
+      
----------------
`LLVM_LIBDIR_SUFFIX`

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:593-600
+  err = cuLaunchKernel(KernelInfo->Func, blocksPerGrid, 1, 1,
+                       cudaThreadsPerBlock, 1, 1, nshared, 0, &args[0], 0);
+  if (err != CUDA_SUCCESS) {
+    DP("Device kernel launching failed!\n");
+    CUDA_ERR_STRING(err);
+    assert(err == CUDA_SUCCESS && "Unable to launch target execution!");
+    return OFFLOAD_FAIL;
----------------
grokos wrote:
> Hahnfeld wrote:
> > So this launch is asynchronous and it gets synced when fetching data from the device.
> > But what would happen if there is a `target` region immediately following another one that is not yet synced? And can you later on figure out which kernel to wait for?
> All calls are inserted into the default CUDA stream. Kernels and memcopies in the default stream are executed in order, i.e. if we enqueue two kernels back to back, the latter will be executed after the former has completed. So in this version of the plugin (which does not support yet asynchronous executions and memcopies), there is an implicit sync between kernel launches.
Okay, so I now understand that this works technically. However, this may not be very intuitive for the user:

```lang=C
#pragma omp target data map(...)
{
  double start = omp_get_wtime();
  #pragma omp target
  {
    // ...
  }
  // no data transfer back
  double runtime = omp_get_wtime() - start;
}
```

`runtime` will then only measure the time needed to launch the kernel although the target region is synchronous. (I fully understand that there will be a sync before data is copied back, but that's not my point)

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:25-31
+#ifndef TARGET_NAME
+#define TARGET_NAME Generic - 64bit
+#endif
+
+#define GETNAME2(name) #name
+#define GETNAME(name) GETNAME2(name)
+#define DP(...) DEBUGP("Target " GETNAME(TARGET_NAME) " RTL", __VA_ARGS__)
----------------
Can `TARGET_NAME` be hardcoded to `CUDA`?

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:46-52
+/// Account the memory allocated per device.
+struct AllocMemEntryTy {
+  int64_t TotalSize;
+  std::vector<std::pair<void *, int64_t>> Ptrs;
+
+  AllocMemEntryTy() : TotalSize(0) {}
+};
----------------
This is never used

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:197-213
+    // Get environment variables regarding teams
+    char *envStr = getenv("OMP_TEAM_LIMIT");
+    if (envStr) {
+      // OMP_TEAM_LIMIT has been set
+      EnvTeamLimit = std::stoi(envStr);
+      DP("Parsed OMP_TEAM_LIMIT=%d\n", EnvTeamLimit);
+    } else {
----------------
These are not in the standard, should this really start with `OMP_`?

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:496
+
+      err = cuMemcpyDtoH(&ExecModeVal, (CUdeviceptr)ExecModePtr, cusize);
+      if (err != CUDA_SUCCESS) {
----------------
Unneccessary cast to `CUdeviceptr`?

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:684
+
+  int nshared = 0;
+
----------------
Is that going to be used in the future? If so, please add a comment

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:50-56
+/// Account the memory allocated per device.
+struct AllocMemEntryTy {
+  int64_t TotalSize;
+  std::vector<std::pair<void *, int64_t>> Ptrs;
+
+  AllocMemEntryTy() : TotalSize(0) {}
+};
----------------
This is never used

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:129-170
+  // Is the library version incompatible with the header file?
+  if (elf_version(EV_CURRENT) == EV_NONE) {
+    DP("Incompatible ELF library!\n");
+    return 0;
+  }
+
+  char *img_begin = (char *)image->ImageStart;
----------------
This looks the same as in the CUDA plugin. Can all this be refactored into a function?

https://reviews.llvm.org/D14253