[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

Thu Dec 7 14:25:22 PST 2017

Hahnfeld added a comment.

Some comments, still mostly about the build system for the `bclib`

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:73-76
+  # Activate RTL message dumps if requested by the user.
+  if(LIBOMPTARGET_NVPTX_DEBUG)
+    set(CUDA_DEBUG -DOMPTARGET_NVPTX_DEBUG=-1 -g --ptxas-options=-v)
+  endif()
----------------
grokos wrote:
> Hahnfeld wrote:
> > Not used elsewhere and not documented to the user, remove?
> I think we should keep this one. I added the LIBOMPTARGET_NVPTX_DEBUG flag to the list of NVPTX Cmake options in `Build_With_Cmake.txt`.
Needs also to be defined as `CACHE` variable.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:63-67
+  if(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY)
+    set(CUDA_ARCH ${CUDA_ARCH} -gencode arch=compute_${LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY},code=sm_${LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY})
+  else()
+    set(CUDA_ARCH -arch sm_35)
+  endif()
----------------
1. No need for different cases if we don't support building multiple architectures for now.
2. `LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY` needs to be defined as `CACHE` variable, with a default of `35` in this case.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:104-105
+    # Trace we require something from the tree.
+    set(LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER_FROM_TREE "")
+    set(LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER_FROM_TREE "")
+
----------------
Remove, as discussed we should not depend on in-tree components being built.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:109-111
+    elseif(NOT ${LIBOMPTARGET_STANDALONE_BUILD} AND
+        EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../../../../../tools/clang/)
+      if(${CMAKE_C_COMPILER_ID} STREQUAL "Clang")
----------------
This does not depend on a Clang checkout anymore, remove

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:116-122
+    else()
+      find_program(LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER clang++)
+      if(NOT LIBOMPTARGET_NVPTX_SELECTED_CUDA_COMPILER)
+        libomptarget_say("Cannot find a CUDA compiler capable of emitting LLVM bitcode.")
+        libomptarget_say("Please configure with flag -DLIBOMPTARGET_NVPTX_CUDA_COMPILER")
+      endif()
+    endif()
----------------
(Probably this case will go then too...)

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:124-139
+    if (NOT LIBOMPTARGET_NVPTX_BC_LINKER STREQUAL "")
+      set(LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER ${LIBOMPTARGET_NVPTX_BC_LINKER})
+    elseif(NOT ${LIBOMPTARGET_STANDALONE_BUILD})
+        if(MSVC)
+          set(LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER ${LLVM_TOOLS_BINARY_DIR}/llvm-link.exe)
+        else()
+          set(LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER ${LLVM_TOOLS_BINARY_DIR}/llvm-link)
----------------
1. Do not depend on in-tree components.
2. I think you should be able to do the "link" step with Clang as well, can you please test this?

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:171-175
+      if(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY)
+        set(CUDA_ARCH ${CUDA_ARCH} --cuda-gpu-arch=sm_${LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY})
+      else()
+        set(CUDA_ARCH --cuda-gpu-arch=sm_35)
+      endif()
----------------
No need for different cases if we don't support multiple architectures.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:200
+            -o ${CMAKE_CURRENT_BINARY_DIR}/libomptarget-nvptx.bc ${bc_files}
+          DEPENDS ${bc_files} ${LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER_FROM_TREE}
+          COMMENT "Linking LLVM bitcode libomptarget-nvptx.bc"
----------------
Should not depend on `LIBOMPTARGET_NVPTX_SELECTED_BC_LINKER_FROM_TREE` as discussed

================
Comment at: libomptarget/deviceRTLs/nvptx/src/counter_group.h:17-20
+#include <stdlib.h>
+#include <stdio.h>
+
+#include <cuda.h>
----------------
grokos wrote:
> Hahnfeld wrote:
> > Needed?
> > 
> > This file should probably include `option.h` which defines `Counter`
> Right, `cuda.h` is not needed. I've included `option.h`
`stdlib.h` and `stdio.h` are still there...

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:17
+
+#define TICK ((double) 1.0 / 745000000.0)
+
----------------
grokos wrote:
> Hahnfeld wrote:
> > grokos wrote:
> > > Hahnfeld wrote:
> > > > grokos wrote:
> > > > > This is where the hard-coded definition of the GPU clock frequency has been moved. I'll try to find a viable solution to make the library find the clock frequency dynamically.
> > > > Yes, this doesn't sound like a good idea to have that hard-coded...
> > > Getting the clock frequency in device code cannot be done. We can only query it on the host.
> > > 
> > > I tried having a device global variable TICK and set it via a call to `cuModuleGetGlobal(..., "TICK")` from the CUDA plugin (the plugin can query the frequency via `cudaGetDeviceProperties`). This solution did not work because `libomptarget-nvptx.a` is a static library so the clock frequency should be set at compile time. We cannot use dynamic libraries (yet?) because the CUDA toolchain does not support dynamic linking.
> > > 
> > > Eventually I implemented `omp_get_wtime()' using the `%globaltimer` register. That's the only viable option. If the register gets removed in the future, there's nothing we can do.
> > > 
> > > `omp_get_wtick()` is probably not needed. No one will ever query the time between clock ticks from within device code... I left the function there so that the linker can find it but it prints a message that this functionality is not implemented.
> > > 
> > > I leave this issue open for further discussion anyway.
> > That's not a justification, I really doubt that anyone will call `omp_get_wtime()` either.
> > 
> > Why not just return 1 nanosecond (the resolution of `%globaltimer`) for `omp_get_wtick` as I proposed?
> OK, I got your point. It's better than having nothing. I've changed the code to return 1ns.
`TICK` can go away then (hopefully)

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:29
+  asm("mov.u64  %0, %%globaltimer;" : "=l"(nsecs));
+  double rc = (double) nsecs / 1E9;
+  PRINT(LD_IO, "call omp_get_wtime() returns %g\n", rc);
----------------
Please reuse `TIMER_PRECISION`

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:401-402
+EXTERN int omp_test_lock(omp_lock_t *lock) {
+  // int atomicCAS(int* address, int compare, int val);
+  // (old == compare ? val : old)
+  int compare = UNSET;
----------------
What's this comment about?

Repository:
  rL LLVM

https://reviews.llvm.org/D14254