[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

Thu Dec 7 13:51:05 PST 2017

grokos added inline comments.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:63
+
+  # Get all the compute capabilities the user requested or use SM_35 by default.
+  if(LIBOMPTARGET_NVPTX_COMPUTE_CAPABILITY)
----------------
Hahnfeld wrote:
> grokos wrote:
> > Hahnfeld wrote:
> > > Clang defaults to `sm_30`, should we be compatible here?
> > OK, I changed the default to `sm_30` and added a comment that this is what clang uses by default as well.
> > 
> > However, the default may change to `sm_35` due to the implementation of `__kmpc_reduce_conditional_lastprivate` which involves a call to `atomicMax` on 64-bit integers, which in turn requires a minimum of `sm_35`. Let's keep this issue open until we decide how to proceed.
> As I've said offline, we then have to change the default **now**. I suppose the won't compile otherwise?
Correct, `sm_35` is the minimum required. I've set the default to 35 and prepared a clang patch to do the same in the compiler.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:151-158
+      # Decide which ptx version to use. We use 5.0 for CUDA 8 or 4.2 for older versions
+      if(CUDA_VERSION_MAJOR LESS 8)
+	set(CUDA_PTX_VERSION ptx42)
+      elseif(CUDA_VERSION_MAJOR LESS 9)
+	set(CUDA_PTX_VERSION ptx50)
+      else()
+        set(CUDA_PTX_VERSION ptx60)
----------------
Hahnfeld wrote:
> Why do we need a newer ptx version, what's wrong with the default? And anyway, we are building LLVM bitcode, that shouldn't involve ptx at all, right?
Arpith claims that the nvptx runtime uses intrinsics that are only available with certain PTX versions (e.g. Volta SM, Cuda 9, PTX60). Also, the logic here has been changed to match clang's choice of ptx version.

================
Comment at: libomptarget/deviceRTLs/nvptx/CMakeLists.txt:184-186
+        foreach(sm ${nvptx_sm_list})
+          set(CUDA_ARCH ${CUDA_ARCH} --cuda-gpu-arch=sm_${sm})
+        endforeach()
----------------
Hahnfeld wrote:
> Do multiple GPU architectures work? I remember that Clang complained about a single output file for multiple architectures...
Indeed, multiple GPU architectures do not work with the current compiler. I've removed that support.

================
Comment at: libomptarget/deviceRTLs/nvptx/docs/ReductionDesign.txt:2
+
+**Design document for OpenMP reductions on the GPU** 
+
----------------
Hahnfeld wrote:
> Do we want to have this in here or should this be bundled with the Clang documentation as the compiler has to generate most of this? @arpith-jacob 
Let's keep it here since the design described is the specialization for the nvptx device.  The main algorithms are indeed implemented in the runtime.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/data_sharing.cu:40
+  unsigned long long Mask = __BALLOT_SYNC(0xFFFFFFFF, true);
+  unsigned long long ShNum = 32 - (getThreadId() & DS_Max_Worker_Warp_Size_Log2_Mask);
+  unsigned long long Sh = Mask << ShNum;
----------------
sfantao wrote:
> Hahnfeld wrote:
> > sfantao wrote:
> > > Hahnfeld wrote:
> > > > sfantao wrote:
> > > > > Hahnfeld wrote:
> > > > > > `DS_Max_Worker_Warp_Size` instead of its hard-coded value? Or is 32 the length of an integer? (Another reason to avoid magic numbers!)
> > > > > Here 32 is both the GPU warp size and the size of the integer __popc() takes, I agree we should get an enumerator to indicate that. What is tricky here, is that the function assumes that the warp size is the same as the 32-bit integer. This assumption is made in CUDA so, I think is fine we also do it here. 
> > > > > 
> > > > > Actually, here I think the name of the function `getWarpMasterActiveThreadId ` is misleading. This function returns the number of active threads in the current warp whose ID in the warp is lower than the current one. If there are no active threads with lower ID,  then the current thread is the warp master. In other words, this function returns zero, if the current thread is the warp master.
> > > > > 
> > > > > This function seems to only be used in `IsWarpMasterActiveThread`. So I suggest remove the current implementation of `IsWarpMasterActiveThread` and rename this function to `IsWarpMasterActiveThread` and have it return:
> > > > > ```
> > > > > return __popc(Sh) == 0u;
> > > > > ```
> > > > I'm not sure I get your arguments: Let's start with the function `__popc()` which returns the number of bits set to 1 in the argument. This just happens to be a 32bit integer which has nothing to do with the warp size as far as I can see.
> > > > 
> > > > If this function really does what you described it deserves documentation or the code needs to be rewritten so that others can understand. This might be related to what `__BALLOT_SYNC(0xFFFFFFFF, true)` as I've already criticized below.
> > > > 
> > > > At the end, I don't understand what your proposal is: The function name you mention is still the same and the return statement basically inverts the value returned of `__popc()`
> > > > I'm not sure I get your arguments: Let's start with the function __popc() which returns the number of bits set to 1 in the argument. This just happens to be a 32bit integer which has nothing to do with the warp size as far as I can see.
> > > > 
> > > > If this function really does what you described it deserves documentation or the code needs to be rewritten so that others can understand. This might be related to what __BALLOT_SYNC(0xFFFFFFFF, true) as I've already criticized below.
> > > 
> > > The function is documented in http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html#group__CUDA__MATH__INTRINSIC__INT_1g43c9c7d2b9ebf202ff1ef5769989be46. Using 32-bit integers to do mask operations has all to do with the warp size in the sense that if we had larger warp sizes we would have to change the literals (including the ones used with `__ballot`) and use different versions or a combination of multiple `__popc()`. 
> > > 
> > > 
> > > > At the end, I don't understand what your proposal is: The function name you mention is still the same and the return statement basically inverts the value returned of __popc()
> > > 
> > > All I am suggesting is to remove the current implementation of  `IsWarpMasterActiveThread`, and rename `getWarpMasterActiveThreadId ` to  `IsWarpMasterActiveThread`.  If `__popc` returns zero means the current thread is the master thread, so comparing it with zero will return the correct result. This is what `IsWarpMasterActiveThread` does right now.
> > > 
> > > I agree we can improve the comments to document all this.
> > Btw: If we just compare `__popc(x) == 0` then we know `x == 0` because x has zero bits set to 1. So no point in doing `__popc()` at all?
> > Btw: If we just compare __popc(x) == 0 then we know x == 0 because x has zero bits set to 1. So no point in doing __popc() at all?
> 
> True, given that we are using that only to compare against zero, evaluating `x ==0` is sufficient.
OK, so I've replaced the magic number 32 with enum macro `DS_Max_Worker_Warp_Size` (which is equal to 32 and to the warp size), removed `IsWarpMasterActiveThread` and renamed `getWarpMasterActiveThreadId` to `IsWarpMasterActiveThread`. Also, I'm just returning `Sh == 0`, no need for `__popc()` at all.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/data_sharing.cu:132
+  unsigned WID = getWarpId();
+  unsigned CurActiveThreads = __BALLOT_SYNC(0xFFFFFFFF, true);
+
----------------
Hahnfeld wrote:
> grokos wrote:
> > Hahnfeld wrote:
> > > Can we have this in a function with a corresponding name?
> > It's a single line which calls a built-in. Too simple for a dedicated function. Anyway, what do you propose?
> There are other single-line functions and it is not obvious to me how `__BALLOT_SYNC(0xFFFFFFFF, true)` returns `CurActiveThreads` - btw. is this a number or as mask?
I added a short function with a description of what is does:
```
// Find the active threads in the warp - return a mask whose n-th bit is set if
// the n-th thread in the warp is active.
__device__ static unsigned getActiveThreadsMask() {
  return __BALLOT_SYNC(0xFFFFFFFF, true);
}
```

What `__ballot_sync` does can be found [[ http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions | here ]]. For ALL threads in the warp (`mask = 0xFFFFFFFF`) we evaluate the predicate (`true`). The result is a mask whose n-th bit is set if the n-th thread of the warp is active.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:17
+
+#define TICK ((double) 1.0 / 745000000.0)
+
----------------
Hahnfeld wrote:
> grokos wrote:
> > Hahnfeld wrote:
> > > grokos wrote:
> > > > This is where the hard-coded definition of the GPU clock frequency has been moved. I'll try to find a viable solution to make the library find the clock frequency dynamically.
> > > Yes, this doesn't sound like a good idea to have that hard-coded...
> > Getting the clock frequency in device code cannot be done. We can only query it on the host.
> > 
> > I tried having a device global variable TICK and set it via a call to `cuModuleGetGlobal(..., "TICK")` from the CUDA plugin (the plugin can query the frequency via `cudaGetDeviceProperties`). This solution did not work because `libomptarget-nvptx.a` is a static library so the clock frequency should be set at compile time. We cannot use dynamic libraries (yet?) because the CUDA toolchain does not support dynamic linking.
> > 
> > Eventually I implemented `omp_get_wtime()' using the `%globaltimer` register. That's the only viable option. If the register gets removed in the future, there's nothing we can do.
> > 
> > `omp_get_wtick()` is probably not needed. No one will ever query the time between clock ticks from within device code... I left the function there so that the linker can find it but it prints a message that this functionality is not implemented.
> > 
> > I leave this issue open for further discussion anyway.
> That's not a justification, I really doubt that anyone will call `omp_get_wtime()` either.
> 
> Why not just return 1 nanosecond (the resolution of `%globaltimer`) for `omp_get_wtick` as I proposed?
OK, I got your point. It's better than having nothing. I've changed the code to return 1ns.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/libcall.cu:401
+
+//for xlf Fotran
+//Fotran, the return is LOGICAL type
----------------
Hahnfeld wrote:
> grokos wrote:
> > Hahnfeld wrote:
> > > Can you check that these are still used? Especially the `debug` functions?
> > I checked with the Fortran people and those functions are still used, so we'll keep them.
> I'm especially concerned about the  `debug` functions as we removed most of them
Fortran people insist that all `_xlf_* functions below are still needed. If they ever become obsolete we can then remove them.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:74
+  DS_Max_Worker_Warp_Size_Log2 = 5,
+  DS_Max_Worker_Warp_Size_Log2_Mask = (~0u >> (32-DS_Max_Worker_Warp_Size_Log2)),
+  // The slot size that should be reserved for a working warp.
----------------
sfantao wrote:
> Hahnfeld wrote:
> > sfantao wrote:
> > > Hahnfeld wrote:
> > > > sfantao wrote:
> > > > > grokos wrote:
> > > > > > Hahnfeld wrote:
> > > > > > > First, this value needs a comment what it is used for and maybe we can drop the `Log2` is its a mast related to the value, not the logarithm.
> > > > > > > 
> > > > > > > From what I understand, this value has has the 5 LSBs set to 1 which should be equivalent to `DS_Max_Worker_Warp_Size - 1`? (this also avoid relying on 32 bit integers)
> > > > > > I've changed its definition to the proposed `DS_Max_Worker_Warp_Size - 1` which is much clearer. This mask is used to get (threadID mod Warp_Size). I've added a comment.
> > > > > Using `DS_Max_Worker_Warp_Size - 1` its fine here. This is mostly used to calculate the ID of a thread in its warp, i.e. `threadID() % warpSize` which is equivalent to ` threadID() & DS_Max_Worker_Warp_Size`. The compiler should be smart enough to convert one to the other if `warpSize` is a constant. 
> > > > > 
> > > > > I'd rather prefer do make little optimisations explicit in the code, but if one thinks this extra macro is too much, one could revisit all uses and see if `warpSize` is constant and use `threadID() % warpSize` throughout the code. I don't see any reason for making `warpSize` non-constant anyway. 
> > > > Do you mean `threadID() % warpSize` is equivalent to `threadID() & (DS_Max_Worker_Warp_Size - 1)`? Because otherwise I disagree.
> > > > 
> > > > I agree that the compiler will convert the usage. IMO `threadID() % warpSize` is //a lot more// obvious than the bit operations the compiler might use.
> > > > Do you mean threadID() % warpSize is equivalent to threadID() & (DS_Max_Worker_Warp_Size - 1) ?
> > > 
> > > Yes, that is what I mean. Of course, this assumes warp size is a power of 2.
> > > 
> > > > I agree that the compiler will convert the usage. IMO threadID() % warpSize is a lot more obvious than the bit operations the compiler might use.
> > > 
> > > I don't have a strong opinion here. I agree that `threadID() % warpSize` is easier to read and I think its fine we go with that. It's just that if we intend the implementation to be `threadID() & (DS_Max_Worker_Warp_Size - 1) ` I don't see why we shouldn't write that in the code.  
> > 1. IMO source code is meant to be understood by humans who are normally much more familiar with mathematical operations (modulo) than bit operations - even programmers.
> > 2. Because it assumes a power of 2. If the compiler can prove that - great, do the optimization.
> > (If we happen to find performance implications, we can still change it and refactor it to a function and add the documentation why this is necessary, what it assumes and how it works)
> Fine by me.
So we'll go with the modulo solution.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:100
+};
+extern __device__ __shared__ DataSharingStateTy DataSharingState;
+
----------------
arpith-jacob wrote:
> Hahnfeld wrote:
> > arpith-jacob wrote:
> > > grokos wrote:
> > > > Hahnfeld wrote:
> > > > > This won't compile with current Clang 5.0.0:
> > > > > ```
> > > > > error: __shared__ variable 'DataSharingState' cannot be 'extern'
> > > > > ```
> > > > This looks like a clang problem. Shared variables can be extern. This code can be compiled with clang 4.0, maybe we should submit a bug report for clang 5.0?
> > > Clang doesn't support this attribute and it requires additional investigation.  For now, I would disable building the bclib version of libomptarget-nvptx so that the runtime library is built using nvcc.
> > This was explicitly introduced in https://reviews.llvm.org/D25125. The documentation suggests that the error is correct:
> > ```
> > __device__, __shared__, and __constant__ variables cannot be defined as external using the extern keyword. The only exception is for dynamically allocated __shared__ variables as described in __shared__.
> > ```
> > (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#qualifiers)
> That restriction only applies in whole program compilation mode, not separate compilation mode.  We need to teach Clang/llvm to support separate compilation mode.
> 
> ```
> When compiling in the whole program compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables cannot be defined as external using the extern keyword. The only exception is for dynamically allocated __shared__ variables as described in __shared__.
> 
> When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, and __constant__ variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated __shared__ variable).
> ```
> 
> 
OK, for the time being the default is set to disabled so as not to create any problems while building the runtime.

Repository:
  rOMP OpenMP

https://reviews.llvm.org/D14254