[Openmp-commits] [PATCH] D154172: [OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.

Fri Jun 30 11:31:06 PDT 2023

dhruvachak marked 5 inline comments as done.
dhruvachak added inline comments.

================
Comment at: openmp/libomptarget/DeviceRTL/src/Reduction.cpp:226

+#ifndef __AMDGCN__
     fence::system(atomic::seq_cst);
----------------
dhruvachak wrote:
> jdoerfert wrote:
> > dhruvachak wrote:
> > > arsenm wrote:
> > > > I doubt this is target specific 
> > > Well, I don't know whether it is required for NVPTX for example, so I kept it around.
> > > 
> > > As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.
> > > As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.
> > 
> > 1) This is unrelated.
> > 2) Could you elaborate on why you think the change is sound (on AMDGPUs)?
> 1. Agreed this is unrelated. I can separate this out if you like.
> 2. With reference to https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-gfx90a (gfx90a as an example), the guide points out the required instructions. And that's what we get with this patch for the atomic inc:
> 
> 	s_waitcnt vmcnt(0) lgkmcnt(0)
> 	global_atomic_inc v0, v1, v0, s[2:3] glc
> 	s_waitcnt vmcnt(0)
> 	buffer_wbinvl1_vol
> 
> If we bring back the system fence, here's the resulting sequence (fence followed by the atomic inc):
> 
> 	buffer_wbl2
> 	s_waitcnt vmcnt(0) lgkmcnt(0)
> 	buffer_invl2
> 	buffer_wbinvl1_vol
> 	s_waitcnt vmcnt(0) lgkmcnt(0)
> 	global_atomic_inc v0, v1, v0, s[2:3] glc
> 	s_waitcnt vmcnt(0)
> 	buffer_wbinvl1_vol
> 
> According to the guide, the L2 instructions are required for coherence between different agents. For the reduction use case, we need coherence within a GPU which is provided by buffer_wbinvl1_vol following the atomic inc. So we can drop the L2 instructions here. The duplicate waitcnt and the buffer_wbinvl1_vol can be dropped too.
> 
> In other words, as long as the ordering and scope are correct on the atomic operation, we should not need an additional fence. Perhaps it was required before D137524.
I removed this unrelated change, will post a separate patch for it.

================
Comment at: openmp/libomptarget/DeviceRTL/src/Synchronization.cpp:213-226
   switch (Ordering) {
   default:
     __builtin_unreachable();
   case atomic::relaxed:
-    return __builtin_amdgcn_atomic_inc32(A, V, atomic::relaxed, "");
+    return atomicIncRelaxed(A, V, MemScope);
   case atomic::aquire:
+    return atomicIncAquire(A, V, MemScope);
----------------
jdoerfert wrote:
> 
> 
> 
Used the suggested macro. Thanks @jdoerfert 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154172/new/

https://reviews.llvm.org/D154172