[Openmp-commits] [PATCH] D14253: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget plugins.

Thu Dec 1 14:40:02 PST 2016

grokos commandeered this revision.
grokos added a reviewer: sfantao.
grokos added inline comments.
Herald added subscribers: mgorny, nemanjai.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:14-25
+#include <algorithm>
+#include <assert.h>
+#include <cstdio>
+#include <cuda.h>
+#include <cuda_runtime_api.h>
+#include <libelf.h>
+#include <list>
----------------
Hahnfeld wrote:
> Some of the C header files were not needed on my system (`stdio.h`, `stdlib.h`, `string.h`)
True, I've removed the unnecessary headers.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:110-113
+    for (unsigned i = 0; i < E.Entries.size(); ++i) {
+      if (E.Entries[i].addr == addr)
+        return true;
+    }
----------------
Hahnfeld wrote:
> You could use a range-based loop
Done!

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:427-430
+    const char SuffixTL[] = "_thread_limit";
+    char *ThreadLimitName =
+        (char *)malloc((strlen(e->name) + strlen(SuffixTL)) * sizeof(char));
+    sprintf(ThreadLimitName, "%s%s", e->name, SuffixTL);
----------------
Hahnfeld wrote:
> This memory is leaked - can this be done with some C++-constructs (`stringstream` for example)?
Fixed in new diff.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:566-573
+  // update thread limit content in gpu memory if un-initialized or changed.
+  if (KernelInfo->ThreadLimit == 0 || KernelInfo->ThreadLimit != thread_limit) {
+    // always capped by maximum number of threads in a block: even if 1 OMP
+    // thread is 1 independent CUDA thread, we may have up to max block size OMP
+    // threads if the user request thread_limit(tl) with tl > max block size, we
+    // only start max block size CUDA threads.
+    if (thread_limit > DeviceInfo.ThreadsPerBlock[device_id])
----------------
Hahnfeld wrote:
> Doesn't this mean that a default `thread_limit = 0` or a too high one will be updated for every kernel? (The variable is signed, is a negative value possible and what does it mean?)
The GPU symbol `KernelInfo->ThreadLimit` is not used anymore by our new codegen scheme and I have removed it from the new diff.

To address the second question, the variable is indeed signed but in reality it is never assigned a negative value, that would have no meaning. We chose signed because that's what libomp uses as well, e.g. `omp_set_max_threads()` takes a signed int as argument. Generally the OpenMP standard uses signed ints pretty much everywhere even if negative values do not make any sense.

================
Comment at: libomptarget/plugins/cuda/src/rtl.cpp:593-600
+  err = cuLaunchKernel(KernelInfo->Func, blocksPerGrid, 1, 1,
+                       cudaThreadsPerBlock, 1, 1, nshared, 0, &args[0], 0);
+  if (err != CUDA_SUCCESS) {
+    DP("Device kernel launching failed!\n");
+    CUDA_ERR_STRING(err);
+    assert(err == CUDA_SUCCESS && "Unable to launch target execution!");
+    return OFFLOAD_FAIL;
----------------
Hahnfeld wrote:
> So this launch is asynchronous and it gets synced when fetching data from the device.
> But what would happen if there is a `target` region immediately following another one that is not yet synced? And can you later on figure out which kernel to wait for?
All calls are inserted into the default CUDA stream. Kernels and memcopies in the default stream are executed in order, i.e. if we enqueue two kernels back to back, the latter will be executed after the former has completed. So in this version of the plugin (which does not support yet asynchronous executions and memcopies), there is an implicit sync between kernel launches.

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:14-26
+#include <algorithm>
+#include <assert.h>
+#include <cstdio>
+#include <dlfcn.h>
+#include <elf.h>
+#include <ffi.h>
+#include <gelf.h>
----------------
Hahnfeld wrote:
> See other comment about header files
Done.

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:110-112
+    for (std::list<DynLibTy>::iterator ii = DynLibs.begin(),
+                                       ie = DynLibs.begin();
+         ii != ie; ++ii)
----------------
Hahnfeld wrote:
> 1. `DynLibs` isn't used anywhere else?
> 2. If it is needed: I think this doesn't work as expected (from `begin` to `begin`) - then maybe rewrite as range-based loop?
Right, this is a bug. DynLibs is meant to be used to keep track of what libraries have been loaded so that we can free the resources allocated for them at the end. I have fixed the loop range in the new diff.

================
Comment at: libomptarget/plugins/generic-elf-64bit/src/rtl.cpp:243-245
+  // load dynamic library and get the entry points. We use the dl library
+  // to do the loading of the library, but we could do it directly to avoid the
+  // dump to the temporary file.
----------------
Hahnfeld wrote:
> So if it's possible, why not do it?
It is possible to avoid the temp file, but whatever method we use to do it will not be portable. Let's stick with this implementation.

https://reviews.llvm.org/D14253