[Libclc-dev] [PATCH] R600: Use new barrier intrinsic

Fri Jul 15 09:03:15 PDT 2016

On Thu, 2016-07-14 at 23:48 -0700, Matt Arsenault via Libclc-dev wrote:
> > 
> > On Jul 14, 2016, at 20:03, Aaron Watry <awatry at gmail.com> wrote:
> > 
> > Re-adding libclc to CC list.... oops.
> > 
> > On Thu, Jul 14, 2016 at 10:03 PM, Aaron Watry <awatry at gmail.com
> > <mailto:awatry at gmail.com>> wrote:
> > From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00
> > 2001
> > From: Matt Arsenault <matthew.arsenault at Amd.com>
> > Date: Wed, 13 Jul 2016 21:07:53 -0700
> > Subject: [PATCH] R600: Use new barrier intrinsic
> > 
> > ---
> >  r600/lib/synchronization/barrier_impl.ll | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> > 
> > diff --git a/r600/lib/synchronization/barrier_impl.ll
> > b/r600/lib/synchronization/barrier_impl.ll
> > index 825b2eb..9b8fefb 100644
> > --- a/r600/lib/synchronization/barrier_impl.ll
> > +++ b/r600/lib/synchronization/barrier_impl.ll
> > @@ -1,7 +1,6 @@
> >  declare i32 @__clc_clk_local_mem_fence() #1
> >  declare i32 @__clc_clk_global_mem_fence() #1
> > -declare void @llvm.AMDGPU.barrier.local() #0
> > -declare void @llvm.AMDGPU.barrier.global() #0
> > +declare void @llvm.r600.group.barrier() #0
> >  
> >  define void @barrier(i32 %flags) #2 {
> >  barrier_local_test:
> > @@ -11,7 +10,7 @@ barrier_local_test:
> >    br i1 %1, label %barrier_local, label %barrier_global_test
> >  
> >  barrier_local:
> > -  call void @llvm.AMDGPU.barrier.local()
> > +  call void @llvm.r600.group.barrier()
> > 
> > Please pardon my ignorance, but does the new intrinsic handle
> > synchronizing writes/reads to local/global memory along with thread
> > synchronization?
> > 
> > If that is the case, could we modify the barrier_local portion to
> > jump straight to done, so that we don't end up inserting two
> > barriers if someone calls
> > barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);
> > 
> > I don't have my NI card installed at the moment, but could if
> > necessary to test things out.
> > 
> > --Aaron
> > 
> 
> No, but the old intrinsics didn’t either. There is only one
> synchronization instruction, the memory fence is a different problem.
> I don’t think it actually matters with the constraint that global
> memory is only consistent within a workgroup, although I”m less sure
> how the old hardware caches worked

EG/NI both reload/flush L1 (VC and TC) caches at the beginning and end
of every fetch clause. We can make the barrier instruction wait for
previous CF instructions (bit 31 CF_WORD0), but afaik it only waits for
instruction to complete (i.e I don't think it waits for the flush
complete).

There are _ACK versions of VC/TC fetch clauses, and WAIT_ACK
instruction as fence, although I haven't found decisive info on whether
these do wait for flushes to complete (or how they differ from using
barrier bit).

I agree with the point about issuing two barrier instructions. IMO the
correct implementation of this functions for r600 would be:

void barrier(cl_mem_fence_flags flags) {
  mem_fence(flags);
  __builtin_r600_barrier();
}

the current implementation also fails to issue any barrier if flags ==
0, so reducing it to:

define void @barrier(i32 %flags) #2 {
  call void @llvm.r600.group.barrier()
  ret void
}

would be an improvement

jan
> 
> -Matt
> 
> _______________________________________________
> Libclc-dev mailing list
> Libclc-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
-- 
Jan Vesely <jan.vesely at rutgers.edu>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/libclc-dev/attachments/20160715/c71da77a/attachment.sig>