[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

Sun Jan 7 10:09:07 PST 2018

Hahnfeld added a comment.

Two global remarks:

1. I think we agreed on having `<thread id> % <warp size>` instead of bit operations.
2. Somewhat related, there are currently at least four different "symbols" for the warp size: `warpSize`, `WARPSIZE`, `DS_Max_Worker_Warp_Size`, and `DS_Max_Warp_Number`. This needs to be exactly //one// that is used throughout the runtime.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/counter_groupi.h:14
+
+#include <cuda_runtime.h>
+#include "option.h"
----------------
Does this file need to include `cuda_runtime.h`?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/interface.h:360-361
+// Support for reducing conditional lastprivate variables
+EXTERN void __kmpc_reduce_conditional_lastprivate(kmp_Indent *loc,
+  int32_t global_tid, int32_t varNum, void *array);
+
----------------
Will the host runtime use the same entry point? Because it's not yet present in `libomp`

================
Comment at: libomptarget/deviceRTLs/nvptx/src/interface.h:364-368
+EXTERN int32_t __kmpc_reduce41(kmp_Indent *loc, int32_t global_tid,
+                               int32_t varNum, size_t reduceSize,
+                               void *reduceData, void *reduceArraySize,
+                               kmp_ReductFctPtr *reductFct,
+                               kmp_CriticalName *lock);
----------------
I think this went away in previous update of this patch.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:74
+  DS_Max_Worker_Warp_Size_Log2 = 5,
+  DS_Max_Worker_Warp_Size_Log2_Mask = (~0u >> (32-DS_Max_Worker_Warp_Size_Log2)),
+  // The slot size that should be reserved for a working warp.
----------------
grokos wrote:
> sfantao wrote:
> > Hahnfeld wrote:
> > > sfantao wrote:
> > > > Hahnfeld wrote:
> > > > > sfantao wrote:
> > > > > > grokos wrote:
> > > > > > > Hahnfeld wrote:
> > > > > > > > First, this value needs a comment what it is used for and maybe we can drop the `Log2` is its a mast related to the value, not the logarithm.
> > > > > > > > 
> > > > > > > > From what I understand, this value has has the 5 LSBs set to 1 which should be equivalent to `DS_Max_Worker_Warp_Size - 1`? (this also avoid relying on 32 bit integers)
> > > > > > > I've changed its definition to the proposed `DS_Max_Worker_Warp_Size - 1` which is much clearer. This mask is used to get (threadID mod Warp_Size). I've added a comment.
> > > > > > Using `DS_Max_Worker_Warp_Size - 1` its fine here. This is mostly used to calculate the ID of a thread in its warp, i.e. `threadID() % warpSize` which is equivalent to ` threadID() & DS_Max_Worker_Warp_Size`. The compiler should be smart enough to convert one to the other if `warpSize` is a constant. 
> > > > > > 
> > > > > > I'd rather prefer do make little optimisations explicit in the code, but if one thinks this extra macro is too much, one could revisit all uses and see if `warpSize` is constant and use `threadID() % warpSize` throughout the code. I don't see any reason for making `warpSize` non-constant anyway. 
> > > > > Do you mean `threadID() % warpSize` is equivalent to `threadID() & (DS_Max_Worker_Warp_Size - 1)`? Because otherwise I disagree.
> > > > > 
> > > > > I agree that the compiler will convert the usage. IMO `threadID() % warpSize` is //a lot more// obvious than the bit operations the compiler might use.
> > > > > Do you mean threadID() % warpSize is equivalent to threadID() & (DS_Max_Worker_Warp_Size - 1) ?
> > > > 
> > > > Yes, that is what I mean. Of course, this assumes warp size is a power of 2.
> > > > 
> > > > > I agree that the compiler will convert the usage. IMO threadID() % warpSize is a lot more obvious than the bit operations the compiler might use.
> > > > 
> > > > I don't have a strong opinion here. I agree that `threadID() % warpSize` is easier to read and I think its fine we go with that. It's just that if we intend the implementation to be `threadID() & (DS_Max_Worker_Warp_Size - 1) ` I don't see why we shouldn't write that in the code.  
> > > 1. IMO source code is meant to be understood by humans who are normally much more familiar with mathematical operations (modulo) than bit operations - even programmers.
> > > 2. Because it assumes a power of 2. If the compiler can prove that - great, do the optimization.
> > > (If we happen to find performance implications, we can still change it and refactor it to a function and add the documentation why this is necessary, what it assumes and how it works)
> > Fine by me.
> So we'll go with the modulo solution.
I don't see this implemented?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/option.h:20-43
+////////////////////////////////////////////////////////////////////////////////
+// following two defs must match absolute limit hardwired in the host RTL
+#define TEAMS_ABSOLUTE_LIMIT                                                   \
+  512 /* omptx limit (must match teamsAbsoluteLimit) */
+#define THREAD_ABSOLUTE_LIMIT                                                  \
+  1024 /* omptx limit (must match threadAbsoluteLimit) */
+
----------------
Only `MAX_THREADS_PER_TEAM` and `WARPSIZE` are used, the rest can go away.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/option.h:69-76
+// TODO: KMP defines kmp_int to be 32 or 64 bits depending on the target.
+// think we don't need it here (meaning we can be always 64 bit compatible)
+//
+// #ifdef KMP_I8
+//   typedef kmp_int64		kmp_int;
+// #else
+//   typedef kmp_int32		kmp_int;
----------------
Not needed, remove.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/parallel.cu:458
+
+// Do not do nothing: the host guarantees we started the requested number of
+// teams and we only need inspection gridDim
----------------
Typo: `Do nothing`. Start the new sentence with a capital letter and and with a full stop.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:138-155
+//
+// runtime support for array reduction
+//
+
+#define ARRAYATOMIC_GENOP(_name, _dtype, _op)                                  \
+  EXTERN void __array_atomic_##_name##_##_op(                                  \
+      kmp_Indent *id_ref, int32_t gtid, _dtype *lhs, _dtype *rhs, int64_t n) { \
----------------
This isn't used anymore, please remove.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:172-174
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 700
+INLINE
+int32_t nvptx_parallel_reduce_nowait(int32_t global_tid, int32_t num_vars,
----------------
1. This has a lot of commented code.
2. Please don't duplicate the function header, this will avoid mismatches in the future.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/reduction.cu:313-315
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 700
+INLINE
+int32_t nvptx_teams_reduce_nowait(
----------------
I think this header shouldn't be duplicated either.

Repository:
  rL LLVM

https://reviews.llvm.org/D14254