[PATCH] D101976: [OpenMP] Unified entry point for SPMD & generic kernels in the device RTL

Thu May 6 18:45:34 PDT 2021

JonChesterfield added a comment.

In D101976#2742919 <https://reviews.llvm.org/D101976#2742919>, @jdoerfert wrote:

> So what do you wnat me to change for this patch now?

Equivalent change to amdgpu target_impl to the nvptx target_impl, which looks like syncthreads should call a new barrier.

Iiuc this has run successfully even without that change so hopefully that's sufficient that we won't regress on amdgpu. I'd like to get miniqmc running locally to verify as well, but we may not be able to wait for that.

================
Comment at: openmp/libomptarget/deviceRTLs/common/src/omptarget.cu:195
+
+EXTERN __attribute__((weak))
+int32_t __kmpc_target_init(ident_t *Ident, bool IsSPMD,
----------------
why are these weak?

================
Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/target_impl.cu:65-68
+  asm volatile("barrier.sync %0;"
+               :
+               : "r"(barrier)
+               : "memory");
----------------
jdoerfert wrote:
> ABataev wrote:
> > Why not `__syncthreads`? It is safer to use `__syncthreads` as it is `convergent`. Would be good to mark this code somehow as `convergent` too to avoid incorrect optimizations
> The problem is that syncthreads is basically a `bar.sync` which is a `barrier.sync.aligned`, if I understood everything properly. This worked so far because the "main thread" (lane 0, last warp) was alone in it's warp and all other threads have been terminated. Now, we simplify the control flow (and later get rid of the last warp) such that the threads of the last warp and the main thread will hit different barriers. The former hit the one in the state machine while the latter will be in `parallel_51`. The `.aligned` version doesn't allow that. Does that make sense?
> 
> I'm not concerned about convergent though, we solved that wholesale: We mark all functions that clang compiles for the GPU via openmp-target as convergent (IIRC). The entire device runtime is certainly convergent.
amdgcn presumably needs the same change. Add a barrier and call it from _kmpc_impl_syncthreads.

I think barrier.sync defaults to all threads when the second argument is omitted, so we can use the corresponding kmpc call to get the num_threads argument for it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101976/new/

https://reviews.llvm.org/D101976