[Openmp-commits] [PATCH] D154172: [OpenMP] Added memory scope to atomic::inc API and used the device scope in reduction.
Dhruva Chakrabarti via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Fri Jun 30 11:31:06 PDT 2023
dhruvachak marked 5 inline comments as done.
dhruvachak added inline comments.
================
Comment at: openmp/libomptarget/DeviceRTL/src/Reduction.cpp:226
+#ifndef __AMDGCN__
fence::system(atomic::seq_cst);
----------------
dhruvachak wrote:
> jdoerfert wrote:
> > dhruvachak wrote:
> > > arsenm wrote:
> > > > I doubt this is target specific
> > > Well, I don't know whether it is required for NVPTX for example, so I kept it around.
> > >
> > > As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.
> > > As an aside, I think the system fence is too conservative, but again I kept it unchanged for archs other than amdgpu.
> >
> > 1) This is unrelated.
> > 2) Could you elaborate on why you think the change is sound (on AMDGPUs)?
> 1. Agreed this is unrelated. I can separate this out if you like.
> 2. With reference to https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-memory-model-gfx90a (gfx90a as an example), the guide points out the required instructions. And that's what we get with this patch for the atomic inc:
>
> s_waitcnt vmcnt(0) lgkmcnt(0)
> global_atomic_inc v0, v1, v0, s[2:3] glc
> s_waitcnt vmcnt(0)
> buffer_wbinvl1_vol
>
> If we bring back the system fence, here's the resulting sequence (fence followed by the atomic inc):
>
> buffer_wbl2
> s_waitcnt vmcnt(0) lgkmcnt(0)
> buffer_invl2
> buffer_wbinvl1_vol
> s_waitcnt vmcnt(0) lgkmcnt(0)
> global_atomic_inc v0, v1, v0, s[2:3] glc
> s_waitcnt vmcnt(0)
> buffer_wbinvl1_vol
>
> According to the guide, the L2 instructions are required for coherence between different agents. For the reduction use case, we need coherence within a GPU which is provided by buffer_wbinvl1_vol following the atomic inc. So we can drop the L2 instructions here. The duplicate waitcnt and the buffer_wbinvl1_vol can be dropped too.
>
> In other words, as long as the ordering and scope are correct on the atomic operation, we should not need an additional fence. Perhaps it was required before D137524.
I removed this unrelated change, will post a separate patch for it.
================
Comment at: openmp/libomptarget/DeviceRTL/src/Synchronization.cpp:213-226
switch (Ordering) {
default:
__builtin_unreachable();
case atomic::relaxed:
- return __builtin_amdgcn_atomic_inc32(A, V, atomic::relaxed, "");
+ return atomicIncRelaxed(A, V, MemScope);
case atomic::aquire:
+ return atomicIncAquire(A, V, MemScope);
----------------
jdoerfert wrote:
>
>
>
Used the suggested macro. Thanks @jdoerfert
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D154172/new/
https://reviews.llvm.org/D154172
More information about the Openmp-commits
mailing list