[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

Tue Nov 7 14:57:53 PST 2017

grokos marked 19 inline comments as done.
grokos added inline comments.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:59-61
+  include_directories(
+      src/
+  )
----------------
Hahnfeld wrote:
> Should not be needed, at least by CUDA can correctly find relative paths in `#include "omptarget-nvptx.h"`
Removed in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/cancel.cu:14
+
+#include "../../../deviceRTLs/nvptx/src/omptarget-nvptx.h"
+
----------------
Hahnfeld wrote:
> I think this can be `#include "omptarget-nvptx.h"` (other files as well)
Done in new diff (for other files as well).

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:16-17
+#include "../../../deviceRTLs/nvptx/src/omptarget-nvptx.h"
+NOINLINE void PrintTaskDescr(omptarget_nvptx_TaskDescr *taskDescr, char *title,
+                             int level);
+
----------------
Hahnfeld wrote:
> Move this declaration into `debug.h`?
Done in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:142
+  int rc = 1;
+  PRINT(LD_IO, "call omp_get_nested() returns %d\n", rc);
+  return rc;
----------------
Hahnfeld wrote:
> Typo, should say `omp_get_max_active_levels`
Done!

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:247-254
+  if (kind >= omp_sched_static && kind < omp_sched_auto) {
+    omptarget_nvptx_TaskDescr *currTaskDescr = getMyTopTaskDescriptor();
+    currTaskDescr->SetRuntimeSched(kind);
+    currTaskDescr->RuntimeChunkSize() = modifier;
+    PRINT(LD_IOD, "omp_set_schedule did set sched %d & modif %d\n",
+          (int)currTaskDescr->GetRuntimeSched(),
+          currTaskDescr->RuntimeChunkSize());
----------------
Hahnfeld wrote:
> This makes it impossible to later reset the schedule to this combinition - can this happen in theory?
I think it is the user's responsibility to cache the old scheduling method if they wish to restore it later. I couldn't find anything in the standard implying that resetting to a previous scheduling scheme must be supported by the runtime. Does the host runtime do that?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/loop.cu:18-19
+
+EXTERN void CompleteCG(omptarget_nvptx_CounterGroup &cg, Counter *priv,
+                       Counter n);
+
----------------
Hahnfeld wrote:
> No other usage?
Obsolete function - removed in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.cu:55-56
+    // init global icv
+    omptarget_nvptx_threadPrivateContext->GlobalICV()->gpuCycleTime =
+        1.0 / 745000000.0; // host reports 745 mHz
+    omptarget_nvptx_threadPrivateContext->GlobalICV()->cancelPolicy =
----------------
Hahnfeld wrote:
> Hard-coded?
Kernel initialization has been vastly modified in the new diff and setting the GPU cycle time is no longer there altogether. However, you're right, the clock speed has been hardcoded because it was easiest to do so. I will change that behavior in the new diff so that the clock speed is set dynamically.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/option.h:34-40
+#ifdef OMPTHREAD_IS_WARP
+// assume here one OpenMP thread per CUDA warp
+#define MAX_NUM_OMP_THREADS MAX_NUM_WARPS
+#else
+// assume here one OpenMP thread per CUDA thread
+#define MAX_NUM_OMP_THREADS MAX_NUM_THREADS
+#endif
----------------
Hahnfeld wrote:
> A feature-switch?
What do you mean?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/parallel.cu:155-156
+    newTaskDescr->ThreadId() /= numLanes;
+    // newTaskDescr->ThreadsInTeam(); // =
+    // newTaskDescr->ThreadsInTeam()/numLanes;
+  }
----------------
Hahnfeld wrote:
> dead code?
Code has been cleaned up in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:20
+
+EXTERN void omp_reduction_op_gpu(char *, char *);
+
----------------
Hahnfeld wrote:
> Needed?
Has been removed.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:97-116
+  int globalThreadId = GetGlobalThreadId();
+  omptarget_nvptx_TaskDescr *currTaskDescr =
+      getMyTopTaskDescriptor(globalThreadId);
+  int numthread;
+  if (currTaskDescr->IsParallelConstruct()) {
+    numthread = omp_get_num_threads();
+  } else {
----------------
Hahnfeld wrote:
> Same as `__kmpc_reduce_gpu` (and `__kmpc_reduce_nowait41`)?
`__kmpc_reduce_nowait41` has been removed from new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:307
+
+#define ATOMIC_GENOP_DC(_op)                                                   \
+  EXTERN void __kmpc_atomic_cmplx8_##_op(kmp_Indent *id_ref, int32_t gtid,     \
----------------
Hahnfeld wrote:
> `__kmpc_atomic_cmplx8_##_op##_cpt`?
Added in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:330-350
+// implementation with shared
+#define ATOMIC_GENOP_DC_obsolete(_op)                                          \
+  EXTERN void __kmpc_atomic_cmplx16_##_op(kmp_Indent *id_ref, int32_t gtid,    \
+                                          double _Complex *lhs,                \
+                                          double _Complex rhs) {               \
+    __shared__ unsigned int stepinblock;                                       \
+    unsigned tnum = __ballot(1);                                               \
----------------
Hahnfeld wrote:
> Not needed anymore?
Has been removed in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:352-355
+ATOMIC_GENOP_DC(add);
+ATOMIC_GENOP_DC(sub);
+ATOMIC_GENOP_DC(mul);
+ATOMIC_GENOP_DC(div);
----------------
Hahnfeld wrote:
> `rev` versions for `sub` and `div`?
Added in new diff.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:516-533
+#define ATOMIC_GENOP_ALL_MIXED_FIXED8U(_name, _dirname, _tname, _optype)       \
+  _dirname(_tname, _optype, add, Add);                                         \
+  _name(_tname, _optype, sub);                                                 \
+  _name##_REV(_tname, _optype, sub);                                           \
+  _name(_tname, _optype, mul);                                                 \
+  _name(_tname, _optype, div);                                                 \
+  _name##_REV(_tname, _optype, div);                                           \
----------------
Hahnfeld wrote:
> Below `ATOMIC_GENOP_ALL_MIXED` is also used for `fixed8u` - where there any changes planned on purpose?
Macro `ATOMIC_GENOP_ALL_MIXED_FIXED8U` has been removed in new diff, we only use `ATOMIC_GENOP_ALL_MIXED` now.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:663-666
+template <>
+INLINE __device__ float Compute<float, omptarget_nvptx_add>(float a, float b) {
+  return a + b;
+}
----------------
Hahnfeld wrote:
> Could you add a comment explaining why this has to be specialized?
A comment has been added in the new diff. The specialization is useful in order to avoid the switch at runtime.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:916-919
+template <>
+INLINE __device__ int myshfldown<int>(int val, unsigned int delta, int size) {
+  return __shfl_down(val, delta, size);
+}
----------------
Hahnfeld wrote:
> Is there a need to specialize this template?
`myshfldown` has been removed altogether from new diff. It was a temporary implementation. The new diff has proper `__kmpc_shuffle_int32` / `__kmpc_shuffle_int64` functions.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:961-971
+  case omptarget_nvptx_inc:
+  case omptarget_nvptx_dec:
+  case omptarget_nvptx_add:
+  case omptarget_nvptx_sub:
+  case omptarget_nvptx_sub_rev:
+    return (T)0;
+  case omptarget_nvptx_mul:
----------------
Hahnfeld wrote:
> Other types? Are they correctly defaulting?
In the new version this switch is slightly different. `mul`, `div`, `div_rev`, `andl` and `andb` return `(T)1`, all others default to `T(0)`.

https://reviews.llvm.org/D14254