[Openmp-commits] [PATCH] D64217: [OpenMP][NFCI] Cleanup the target state queue implementation

Tue Aug 6 10:32:21 PDT 2019

JonChesterfield added inline comments.

================
Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.h:19
+/// value of the pointee.
+template <typename T> T __kmpc_impl_atomic_add(T *Ptr, T Val) {
+  return atomicAdd(Ptr, Val);
----------------
grokos wrote:
> ABataev wrote:
> > JonChesterfield wrote:
> > > jdoerfert wrote:
> > > > ABataev wrote:
> > > > > Why do we need these functions? IMHO, they do not add anything useful, just increase code complexity.
> > > > They add a level of abstraction around the CUDA implementation, here the CUDA functions `atomicAdd` and `atomicExch`. In the AMD `target_impl.h` the `__kmpc_impl_atomic_exchange` would maybe be implemented with an explicit compare-exchange loop (I'm guessing here).
> > > As it happens, atomicExch works fine for amdgcn. I'm not sure all the atomics are supported.
> > > 
> > > Closely related though, atomicMax doesn't appear to be available for  __CUDA_ARCH__ < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS. 
> > > 
> > > Given some variation in capability between cuda revisions, a compatibility shim that implements at least atomicMax seems a good idea. I'm not sure writing a compatibility layer for all of the possible atomics is within scope - such a thing may already exist elsewhere.
> > It works for AMD but may not work for some other platforms. Better to make it target-dependent anyway.
> > Closely related though, atomicMax doesn't appear to be available for CUDA_ARCH < 350, so on nvptx there's a function in loop.cu (__kmpc_reduce_conditional_lastprivate) which open codes a max in terms of atomicCAS.
> 
> Just a correction, 64-bit `atomicMax` is not available for CUDA_ARCH < 350, the 32-bit version is. There has been an attempt to emulate the 64-bit `atomicMax` here: https://reviews.llvm.org/D46185 (although that patch was wrong and the author hasn't responded in well over a year).
Ah, nice. Thanks for the detail. Curiously there doesn't appear to be a signed 64 bit atomicMax in cuda (according to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

Rocm's tree contains that (revised) patch to nvptx, presumably for whatever sm_30 device we have under test. I was under the impression that our nvptx tree was identical to upstream but actually there's some divergence for me to patch up.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64217/new/

https://reviews.llvm.org/D64217