[Openmp-commits] [PATCH] D14253: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget plugins.

Thu Jan 26 19:36:39 PST 2017

grokos added inline comments.

================
Comment at: libomptarget/CMakeLists.txt:107
+  # Build offloading plugins and device RTLs if they are available.
+  add_subdirectory(plugins)
+  add_subdirectory(deviceRTLs)
----------------
Hahnfeld wrote:
> Only this line should be in the diff
Fixed.

================
Comment at: libomptarget/plugins/CMakeLists.txt:40
+      install(TARGETS "omptarget.rtl.${tmachine_libname}" 
+        LIBRARY DESTINATION "lib")
+        
----------------
Hahnfeld wrote:
> This should also take `LLVM_LIBDIR_SUFFIX` into account
I'm now using `LIBOMPTARGET_LIBDIR_SUFFIX` as defined in the root CMakeLists.txt.

================
Comment at: libomptarget/plugins/cuda/CMakeLists.txt:22
+    
+      if(CMAKE_BUILD_TYPE MATCHES Debug)
+          add_definitions(-DCUDA_ERROR_REPORT)
----------------
Hahnfeld wrote:
> `LIBOMPTARGET_CMAKE_BUILD_TYPE` to match against lowercase?
Done.

================
Comment at: libomptarget/plugins/cuda/CMakeLists.txt:31
+      # Install plugin under the lib destination folder.
+      install(TARGETS omptarget.rtl.cuda LIBRARY DESTINATION "lib")
+      
----------------
Hahnfeld wrote:
> `LLVM_LIBDIR_SUFFIX`
Same as above.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:593-600
+  err = cuLaunchKernel(KernelInfo->Func, blocksPerGrid, 1, 1,
+                       cudaThreadsPerBlock, 1, 1, nshared, 0, &args[0], 0);
+  if (err != CUDA_SUCCESS) {
+    DP("Device kernel launching failed!\n");
+    CUDA_ERR_STRING(err);
+    assert(err == CUDA_SUCCESS && "Unable to launch target execution!");
+    return OFFLOAD_FAIL;
----------------
Hahnfeld wrote:
> grokos wrote:
> > Hahnfeld wrote:
> > > So this launch is asynchronous and it gets synced when fetching data from the device.
> > > But what would happen if there is a `target` region immediately following another one that is not yet synced? And can you later on figure out which kernel to wait for?
> > All calls are inserted into the default CUDA stream. Kernels and memcopies in the default stream are executed in order, i.e. if we enqueue two kernels back to back, the latter will be executed after the former has completed. So in this version of the plugin (which does not support yet asynchronous executions and memcopies), there is an implicit sync between kernel launches.
> Okay, so I now understand that this works technically. However, this may not be very intuitive for the user:
> 
> ```lang=C
> #pragma omp target data map(...)
> {
>   double start = omp_get_wtime();
>   #pragma omp target
>   {
>     // ...
>   }
>   // no data transfer back
>   double runtime = omp_get_wtime() - start;
> }
> ```
> 
> `runtime` will then only measure the time needed to launch the kernel although the target region is synchronous. (I fully understand that there will be a sync before data is copied back, but that's not my point)
Right, in fact we've already had complaints about spurious time measurements. I have added a `cudaDeviceSynchronize` after the kernel launch.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:25-31
+#ifndef TARGET_NAME
+#define TARGET_NAME Generic - 64bit
+#endif
+
+#define GETNAME2(name) #name
+#define GETNAME(name) GETNAME2(name)
+#define DP(...) DEBUGP("Target " GETNAME(TARGET_NAME) " RTL", __VA_ARGS__)
----------------
Hahnfeld wrote:
> Can `TARGET_NAME` be hardcoded to `CUDA`?
TARGET_NAME is set via cmake. We could hadcode it, but the current scheme gives the flexibility to set the name at will (e.g. add version info, etc.) without tampering with the source code.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:46-52
+/// Account the memory allocated per device.
+struct AllocMemEntryTy {
+  int64_t TotalSize;
+  std::vector<std::pair<void *, int64_t>> Ptrs;
+
+  AllocMemEntryTy() : TotalSize(0) {}
+};
----------------
Hahnfeld wrote:
> This is never used
Leftover from initial code, I've removed it in the new diff.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:197-213
+    // Get environment variables regarding teams
+    char *envStr = getenv("OMP_TEAM_LIMIT");
+    if (envStr) {
+      // OMP_TEAM_LIMIT has been set
+      EnvTeamLimit = std::stoi(envStr);
+      DP("Parsed OMP_TEAM_LIMIT=%d\n", EnvTeamLimit);
+    } else {
----------------
Hahnfeld wrote:
> These are not in the standard, should this really start with `OMP_`?
The next revision of the standard will define env vars for setting the desired number of teams and team limit. We will then update the plugin with the final names. For now, let's leave those temporary names as they are already used by our buildbot for certain tests.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:496
+
+      err = cuMemcpyDtoH(&ExecModeVal, (CUdeviceptr)ExecModePtr, cusize);
+      if (err != CUDA_SUCCESS) {
----------------
Hahnfeld wrote:
> Unneccessary cast to `CUdeviceptr`?
Fixed.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:684
+
+  int nshared = 0;
+
----------------
Hahnfeld wrote:
> Is that going to be used in the future? If so, please add a comment
No, for the time being we have no plans of using dynamically allocated shared memory. In order to eliminate confusion I've removed this variable in the new diff and I use a hard-coded 0 when invoking the kernel launch.

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:50-56
+/// Account the memory allocated per device.
+struct AllocMemEntryTy {
+  int64_t TotalSize;
+  std::vector<std::pair<void *, int64_t>> Ptrs;
+
+  AllocMemEntryTy() : TotalSize(0) {}
+};
----------------
Hahnfeld wrote:
> This is never used
Same as above.

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:129-170
+  // Is the library version incompatible with the header file?
+  if (elf_version(EV_CURRENT) == EV_NONE) {
+    DP("Incompatible ELF library!\n");
+    return 0;
+  }
+
+  char *img_begin = (char *)image->ImageStart;
----------------
Hahnfeld wrote:
> This looks the same as in the CUDA plugin. Can all this be refactored into a function?
True, it's the very same code but it just happens to be so. Another plugin for a different architecture may implement this interface function in a different way. Plugins are meant to be developed independently from each other and their code should not reply upon the existence of some common functionality / refactored code.

https://reviews.llvm.org/D14253