[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

Tue Nov 21 10:33:12 PST 2017

Hahnfeld added inline comments.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:63
+
+  # Get all the compute capabilities the user requested or use SM_35 by default.
+  if(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY)
----------------
grokos wrote:
> Hahnfeld wrote:
> > Clang defaults to `sm_30`, should we be compatible here?
> OK, I changed the default to `sm_30` and added a comment that this is what clang uses by default as well.
> 
> However, the default may change to `sm_35` due to the implementation of `__kmpc_reduce_conditional_lastprivate` which involves a call to `atomicMax` on 64-bit integers, which in turn requires a minimum of `sm_35`. Let's keep this issue open until we decide how to proceed.
As I've said offline, we then have to change the default **now**. I suppose the won't compile otherwise?

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:120
+      endif()
+      set(LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER_FROM_TREE clang)
+    else()
----------------
grokos wrote:
> Hahnfeld wrote:
> > This means that this runtime library is rebuilt whenever the compiler in the tree changes which takes quite some time. Can we maybe only compile the bitcode library if CMAKE_C_COMPILER is Clang?
> I'm not sure I fully understand the point here. Can you elaborate?
Currently, this will try to use the just-built compiler when building in-tree with LLVM and Clang. It does so by adding a dependence to the `clang` target which means that the files are recompiled whenever the compiler changes, ie the `clang` target has a newer modification timestamp.

IMO we shouldn't do this but always build with the `CMAKE_C_COMPILER` that all the other code uses. If that's a (recent) `Clang`, we can build the bclib. If not and it's for example a GCC, we should just deactivate bclib. Note that no other (runtime) library (neither libomp nor host libomptarget nor other libraries as libc++ or compiler-rt) use the just-built compiler for building.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/data_sharing.cu:132
+  unsigned WID = getWarpId();
+  unsigned CurActiveThreads = __BALLOT_SYNC(0xFFFFFFFF, true);
+
----------------
grokos wrote:
> Hahnfeld wrote:
> > Can we have this in a function with a corresponding name?
> It's a single line which calls a built-in. Too simple for a dedicated function. Anyway, what do you propose?
There are other single-line functions and it is not obvious to me how `__BALLOT_SYNC(0xFFFFFFFF, true)` returns `CurActiveThreads` - btw. is this a number or as mask?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:17
+
+#define TICK ((double) 1.0 / 745000000.0)
+
----------------
grokos wrote:
> Hahnfeld wrote:
> > grokos wrote:
> > > This is where the hard-coded definition of the GPU clock frequency has been moved. I'll try to find a viable solution to make the library find the clock frequency dynamically.
> > Yes, this doesn't sound like a good idea to have that hard-coded...
> Getting the clock frequency in device code cannot be done. We can only query it on the host.
> 
> I tried having a device global variable TICK and set it via a call to `cuModuleGetGlobal(..., "TICK")` from the CUDA plugin (the plugin can query the frequency via `cudaGetDeviceProperties`). This solution did not work because `libomptarget-nvptx.a` is a static library so the clock frequency should be set at compile time. We cannot use dynamic libraries (yet?) because the CUDA toolchain does not support dynamic linking.
> 
> Eventually I implemented `omp_get_wtime()' using the `%globaltimer` register. That's the only viable option. If the register gets removed in the future, there's nothing we can do.
> 
> `omp_get_wtick()` is probably not needed. No one will ever query the time between clock ticks from within device code... I left the function there so that the linker can find it but it prints a message that this functionality is not implemented.
> 
> I leave this issue open for further discussion anyway.
That's not a justification, I really doubt that anyone will call `omp_get_wtime()` either.

Why not just return 1 nanosecond (the resolution of `%globaltimer`) for `omp_get_wtick` as I proposed?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:401
+
+//for xlf Fotran
+//Fotran, the return is LOGICAL type
----------------
grokos wrote:
> Hahnfeld wrote:
> > Can you check that these are still used? Especially the `debug` functions?
> I checked with the Fortran people and those functions are still used, so we'll keep them.
I'm especially concerned about the  `debug` functions as we removed most of them

================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:100
+};
+extern __device__ __shared__ DataSharingStateTy DataSharingState;
+
----------------
arpith-jacob wrote:
> grokos wrote:
> > Hahnfeld wrote:
> > > This won't compile with current Clang 5.0.0:
> > > ```
> > > error: __shared__ variable 'DataSharingState' cannot be 'extern'
> > > ```
> > This looks like a clang problem. Shared variables can be extern. This code can be compiled with clang 4.0, maybe we should submit a bug report for clang 5.0?
> Clang doesn't support this attribute and it requires additional investigation.  For now, I would disable building the bclib version of libomptarget-nvptx so that the runtime library is built using nvcc.
This was explicitly introduced in https://reviews.llvm.org/D25125. The documentation suggests that the error is correct:
```
__device__, __shared__, and __constant__ variables cannot be defined as external using the extern keyword. The only exception is for dynamically allocated __shared__ variables as described in __shared__.
```
(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#qualifiers)

Repository:
  rL LLVM

https://reviews.llvm.org/D14254