[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

George Rokos via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Tue Nov 21 09:46:11 PST 2017


grokos marked 26 inline comments as done.
grokos added a comment.

I've responded to the majority of comments. We are now waiting for some other people to reply to questions related to code they wrote. I'll update the diff once all questions have been answered.



================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:63
+
+  # Get all the compute capabilities the user requested or use SM_35 by default.
+  if(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY)
----------------
Hahnfeld wrote:
> Clang defaults to `sm_30`, should we be compatible here?
OK, I changed the default to `sm_30` and added a comment that this is what clang uses by default as well.

However, the default may change to `sm_35` due to the implementation of `__kmpc_reduce_conditional_lastprivate` which involves a call to `atomicMax` on 64-bit integers, which in turn requires a minimum of `sm_35`. Let's keep this issue open until we decide how to proceed.


================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:73-76
+  # Activate RTL message dumps if requested by the user.
+  if(LIBOMPTARGET_NVPTX_DEBUG)
+    set(CUDA_DEBUG -DOMPTARGET_NVPTX_DEBUG=-1 -g --ptxas-options=-v)
+  endif()
----------------
Hahnfeld wrote:
> Not used elsewhere and not documented to the user, remove?
I think we should keep this one. I added the LIBOMPTARGET_NVPTX_DEBUG flag to the list of NVPTX Cmake options in `Build_With_Cmake.txt`.


================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:120
+      endif()
+      set(LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER_FROM_TREE clang)
+    else()
----------------
Hahnfeld wrote:
> This means that this runtime library is rebuilt whenever the compiler in the tree changes which takes quite some time. Can we maybe only compile the bitcode library if CMAKE_C_COMPILER is Clang?
I'm not sure I fully understand the point here. Can you elaborate?


================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:179
+      # This is currently broken
+      set(CUDA_INCLUDES -I/usr/include/powerpc64le-linux-gnu)
+
----------------
Hahnfeld wrote:
> No
I don't know why this directory was included, but it's not needed - I've removed it.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/counter_group.h:17-20
+#include <stdlib.h>
+#include <stdio.h>
+
+#include <cuda.h>
----------------
Hahnfeld wrote:
> Needed?
> 
> This file should probably include `option.h` which defines `Counter`
Right, `cuda.h` is not needed. I've included `option.h`


================
Comment at: libomptarget/deviceRTLs/nvptx/src/critical.cu:14-15
+
+#include <stdio.h>
+#include <complex.h>
+
----------------
Hahnfeld wrote:
> needed?
Probably a leftover, removed.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/data_sharing.cu:34-35
+__device__ static unsigned getMasterThreadId() {
+  unsigned Mask = DS_Max_Worker_Warp_Size - 1;
+  return (getNumThreads() - 1) & (~Mask);
+}
----------------
Hahnfeld wrote:
> Suppose we have 1024 threads, why is `1023 & (~31) = 992` the master thread?
The master warp is the last one and the master thread is the first thread of that warp.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/data_sharing.cu:132
+  unsigned WID = getWarpId();
+  unsigned CurActiveThreads = __BALLOT_SYNC(0xFFFFFFFF, true);
+
----------------
Hahnfeld wrote:
> Can we have this in a function with a corresponding name?
It's a single line which calls a built-in. Too simple for a dedicated function. Anyway, what do you propose?


================
Comment at: libomptarget/deviceRTLs/nvptx/src/debug.cu:21-22
+
+NOINLINE void PrintTaskDescr(omptarget_nvptx_TaskDescr *taskDescr, char *title,
+                             int level) {
+  omp_sched_t sched = taskDescr->GetRuntimeSched();
----------------
Hahnfeld wrote:
> This is used only once in `libcall.cu`. Please inline and remove this file entirely
I removed the entire `debug.cu`, it was obsolete.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:17
+
+#define TICK ((double) 1.0 / 745000000.0)
+
----------------
Hahnfeld wrote:
> grokos wrote:
> > This is where the hard-coded definition of the GPU clock frequency has been moved. I'll try to find a viable solution to make the library find the clock frequency dynamically.
> Yes, this doesn't sound like a good idea to have that hard-coded...
Getting the clock frequency in device code cannot be done. We can only query it on the host.

I tried having a device global variable TICK and set it via a call to `cuModuleGetGlobal(..., "TICK")` from the CUDA plugin (the plugin can query the frequency via `cudaGetDeviceProperties`). This solution did not work because `libomptarget-nvptx.a` is a static library so the clock frequency should be set at compile time. We cannot use dynamic libraries (yet?) because the CUDA toolchain does not support dynamic linking.

Eventually I implemented `omp_get_wtime()' using the `%globaltimer` register. That's the only viable option. If the register gets removed in the future, there's nothing we can do.

`omp_get_wtick()` is probably not needed. No one will ever query the time between clock ticks from within device code... I left the function there so that the linker can find it but it prints a message that this functionality is not implemented.

I leave this issue open for further discussion anyway.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:393-394
+
+  // TODO: should check for the lock to be SET?
+  int ret = atomicCAS(lock, compare, val);
+
----------------
Hahnfeld wrote:
> I think the comment is right here: This looks like it is setting the lock if it is unset!
This is what `omp_test_lock` should do - I've removed the comment.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:401
+
+//for xlf Fotran
+//Fotran, the return is LOGICAL type
----------------
Hahnfeld wrote:
> Can you check that these are still used? Especially the `debug` functions?
I checked with the Fortran people and those functions are still used, so we'll keep them.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:74
+  DS_Max_Worker_Warp_Size_Log2 = 5,
+  DS_Max_Worker_Warp_Size_Log2_Mask = (~0u >> (32-DS_Max_Worker_Warp_Size_Log2)),
+  // The slot size that should be reserved for a working warp.
----------------
Hahnfeld wrote:
> First, this value needs a comment what it is used for and maybe we can drop the `Log2` is its a mast related to the value, not the logarithm.
> 
> From what I understand, this value has has the 5 LSBs set to 1 which should be equivalent to `DS_Max_Worker_Warp_Size - 1`? (this also avoid relying on 32 bit integers)
I've changed its definition to the proposed `DS_Max_Worker_Warp_Size - 1` which is much clearer. This mask is used to get (threadID mod Warp_Size). I've added a comment.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:100
+};
+extern __device__ __shared__ DataSharingStateTy DataSharingState;
+
----------------
Hahnfeld wrote:
> This won't compile with current Clang 5.0.0:
> ```
> error: __shared__ variable 'DataSharingState' cannot be 'extern'
> ```
This looks like a clang problem. Shared variables can be extern. This code can be compiled with clang 4.0, maybe we should submit a bug report for clang 5.0?


================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:951-993
+//
+// runtime support for array reduction
+//
+
+#define ARRAYATOMIC_GENOP(_name, _dtype, _op)                                  \
+  EXTERN void __array_atomic_##_name##_##_op(                                  \
+      kmp_Indent *id_ref, int32_t gtid, _dtype *lhs, _dtype *rhs, int64_t n) { \
----------------
Hahnfeld wrote:
> These macros are never used.
Removed.


================
Comment at: libomptarget/deviceRTLs/nvptx/src/stdio.cu:18
+
+EXTERN int __kmpc_printf(const char *str) { return printf("%s", str); }
----------------
Hahnfeld wrote:
> I think this function isn't used anymore?
Correct, I removed the entire file.


Repository:
  rL LLVM

https://reviews.llvm.org/D14254





More information about the Openmp-commits mailing list