[Openmp-commits] [PATCH] D94745: [OpenMP][WIP] Build the deviceRTLs with OpenMP instead of target dependent language

Tue Jan 19 14:45:20 PST 2021

JonChesterfield added inline comments.

================
Comment at: openmp/libomptarget/deviceRTLs/common/target_atomic.h:22
+// __clang_cuda_device_functions.h.
+template <typename T> T atomicAdd(T *, T);
+template <typename T> T atomicInc(T *, T);
----------------
tianshilei1992 wrote:
> JonChesterfield wrote:
> > Not keen. This is common/, shouldn't be calling functions from cuda.
> > 
> > How about we move target_atomic.h under nvptx, and implement __kmpc_atomic_add etc there?
> > 
> > The amdgpu target_atomic.h would be simpler if it moved under amdgpu, as presently it implements atomicAdd in terms of another function, and could elide that layer entirely if __kmpc_atomic_add called the intrinsic.
> > 
> > We could also implement these in terms of clang intrinsics (i.e., identically as done by amdgpu now), and the result would then work for both platforms.
> I'm okay with your proposal. I'd do it in another patch after this one can work because I want minimal changes in this patch to make everything work, and then start to optimize them. BTW, the template functions doesn't call CUDA here. They're just declarations, and will be lowered to variant of functions based on their type like `fatomicAdd`, which are defined in eventually in different implementations.
For what it's worth, atomicInc is not a template for amdgcn (we only ever use it on a uint32_t, and a generic implementation in terms of CAS would be very slow). Implementation looks like:

```
template <typename T> DEVICE T atomicCAS(T *address, T compare, T val) {
  (void)__atomic_compare_exchange(address, &compare, &val, false,
                                  __ATOMIC_SEQ_CST, __ATOMIC_RELAXED);
  return compare;
}

INLINE uint32_t atomicInc(uint32_t *address, uint32_t max) {
  return __builtin_amdgcn_atomic_inc32(address, max, __ATOMIC_SEQ_CST, "");
}
```

so declaring a template with the same name here will break amdgpu. Which is not a disaster, as it doesn't build on trunk anyway, but will cause me some minor headaches when this flows into rocm.

If I find some time I'll write a patch moving the atomic functions around, should be orthogonal to this change.

================
Comment at: openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt:100
+                 -fopenmp -Xclang -fopenmp-is-device
+                 -D__CUDACC__
                  -I${devicertl_base_directory}
----------------
tianshilei1992 wrote:
> JonChesterfield wrote:
> > This is suspect - why does openmp want to claim to be cuda?
> To maintain minimal change. There is an include wrapped into a macro in `interface.h`. For AMD GPU, it includes one header in AMD implementation, and for CUDA device, it includes a header in NVPTX implementation.
Ah, that's probably my fault. May as well leave it for now.

I think we should expose a macro for openmp that indicates whether we're doing offloading to nvptx, or offloading to amdgpu, or just compiling for the host. Or, I think equivalently, replace some `#if` with variant.

================
Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:21
+
+// FIXME: Forward declaration
+extern "C" {
----------------
tianshilei1992 wrote:
> JonChesterfield wrote:
> > shouldn't these be in the cuda header above, and also in the clang-injected cuda headers?
> All functions that can be called by CUDA are declared as `__device__`. In `declare target`, we cannot call those functions. Instead, we need them to be in a format of OpenMP, so those in `cuda.h` cannot be used.  If not those CUDA version macros, we can drop the header.
I think the right answer to the cuda version macros is to compile this file in the deviceRTL twice, once for < 9000 and once for >9000. It seems reasonable to have a different implementation for the cuda API change. Clang knows what version it is compiling applications for so could pick the matching deviceRTL.bc.

That would let us totally decouple from cuda with some slightly ugly stuff like
`return __nvvm_shfl_down_i32(Var, Delta, ((WARPSIZE - Width) << 8) | 0x1f);`
as typeset in https://reviews.llvm.org/D94731?vs=316809&id=316820#toc

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94745/new/

https://reviews.llvm.org/D94745