[Openmp-commits] [PATCH] D14253: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget plugins.

Mon Feb 1 02:03:12 PST 2016

Hahnfeld added inline comments.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:14-25
@@ +13,14 @@
+
+#include <algorithm>
+#include <assert.h>
+#include <cstdio>
+#include <cuda.h>
+#include <cuda_runtime_api.h>
+#include <libelf.h>
+#include <list>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string>
+#include <string.h>
+#include <vector>
+
----------------
Some of the C header files were not needed on my system (`stdio.h`, `stdlib.h`, `string.h`)

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:110-113
@@ +109,6 @@
+
+    for (unsigned i = 0; i < E.Entries.size(); ++i) {
+      if (E.Entries[i].addr == addr)
+        return true;
+    }
+
----------------
You could use a range-based loop

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:427-430
@@ +426,6 @@
+    // obtain cuda pointer to global tracking thread limit.
+    const char SuffixTL[] = "_thread_limit";
+    char *ThreadLimitName =
+        (char *)malloc((strlen(e->name) + strlen(SuffixTL)) * sizeof(char));
+    sprintf(ThreadLimitName, "%s%s", e->name, SuffixTL);
+
----------------
This memory is leaked - can this be done with some C++-constructs (`stringstream` for example)?

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:566-573
@@ +565,10 @@
+
+  // update thread limit content in gpu memory if un-initialized or changed.
+  if (KernelInfo->ThreadLimit == 0 || KernelInfo->ThreadLimit != thread_limit) {
+    // always capped by maximum number of threads in a block: even if 1 OMP
+    // thread is 1 independent CUDA thread, we may have up to max block size OMP
+    // threads if the user request thread_limit(tl) with tl > max block size, we
+    // only start max block size CUDA threads.
+    if (thread_limit > DeviceInfo.ThreadsPerBlock[device_id])
+      thread_limit = DeviceInfo.ThreadsPerBlock[device_id];
+
----------------
Doesn't this mean that a default `thread_limit = 0` or a too high one will be updated for every kernel? (The variable is signed, is a negative value possible and what does it mean?)

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:593-600
@@ +592,10 @@
+
+  err = cuLaunchKernel(KernelInfo->Func, blocksPerGrid, 1, 1,
+                       cudaThreadsPerBlock, 1, 1, nshared, 0, &args[0], 0);
+  if (err != CUDA_SUCCESS) {
+    DP("Device kernel launching failed!\n");
+    CUDA_ERR_STRING(err);
+    assert(err == CUDA_SUCCESS && "Unable to launch target execution!");
+    return OFFLOAD_FAIL;
+  }
+
----------------
So this launch is asynchronous and it gets synced when fetching data from the device.
But what would happen if there is a `target` region immediately following another one that is not yet synced? And can you later on figure out which kernel to wait for?

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:14-26
@@ +13,15 @@
+
+#include <algorithm>
+#include <assert.h>
+#include <cstdio>
+#include <dlfcn.h>
+#include <elf.h>
+#include <ffi.h>
+#include <gelf.h>
+#include <list>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <vector>
+#include <link.h>
+
----------------
See other comment about header files

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:110-112
@@ +109,5 @@
+    // Close dynamic libraries
+    for (std::list<DynLibTy>::iterator ii = DynLibs.begin(),
+                                       ie = DynLibs.begin();
+         ii != ie; ++ii)
+      if (ii->Handle) {
----------------
1. `DynLibs` isn't used anywhere else?
2. If it is needed: I think this doesn't work as expected (from `begin` to `begin`) - then maybe rewrite as range-based loop?

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:243-245
@@ +242,5 @@
+
+  // load dynamic library and get the entry points. We use the dl library
+  // to do the loading of the library, but we could do it directly to avoid the
+  // dump to the temporary file.
+  //
----------------
So if it's possible, why not do it?

http://reviews.llvm.org/D14253