<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 15, 2016 at 11:03 AM, Jan Vesely <span dir="ltr"><<a href="mailto:jan.vesely@rutgers.edu" target="_blank">jan.vesely@rutgers.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Thu, 2016-07-14 at 23:48 -0700, Matt Arsenault via Libclc-dev wrote:<br>

> ><br>

> > On Jul 14, 2016, at 20:03, Aaron Watry <<a href="mailto:awatry@gmail.com">awatry@gmail.com</a>> wrote:<br>

> ><br>

> > Re-adding libclc to CC list.... oops.<br>

> ><br>

> > On Thu, Jul 14, 2016 at 10:03 PM, Aaron Watry <<a href="mailto:awatry@gmail.com">awatry@gmail.com</a><br>

</span><div><div class="h5">> > <mailto:<a href="mailto:awatry@gmail.com">awatry@gmail.com</a>>> wrote:<br>

> > From 02c998f017107153c6dc78b8eb58192bfc338d80 Mon Sep 17 00:00:00<br>

> > 2001<br>

> > From: Matt Arsenault <matthew.arsenault@Amd.com><br>

> > Date: Wed, 13 Jul 2016 21:07:53 -0700<br>

> > Subject: [PATCH] R600: Use new barrier intrinsic<br>

> ><br>

> > ---<br>

> >  r600/lib/synchronization/barrier_impl.ll | 7 +++----<br>

> >  1 file changed, 3 insertions(+), 4 deletions(-)<br>

> ><br>

> > diff --git a/r600/lib/synchronization/barrier_impl.ll<br>

> > b/r600/lib/synchronization/barrier_impl.ll<br>

> > index 825b2eb..9b8fefb 100644<br>

> > --- a/r600/lib/synchronization/barrier_impl.ll<br>

> > +++ b/r600/lib/synchronization/barrier_impl.ll<br>

> > @@ -1,7 +1,6 @@<br>

> >  declare i32 @__clc_clk_local_mem_fence() #1<br>

> >  declare i32 @__clc_clk_global_mem_fence() #1<br>

> > -declare void @llvm.AMDGPU.barrier.local() #0<br>

> > -declare void @llvm.AMDGPU.barrier.global() #0<br>

> > +declare void @llvm.r600.group.barrier() #0<br>

> >  <br>

> >  define void @barrier(i32 %flags) #2 {<br>

> >  barrier_local_test:<br>

> > @@ -11,7 +10,7 @@ barrier_local_test:<br>

> >    br i1 %1, label %barrier_local, label %barrier_global_test<br>

> >  <br>

> >  barrier_local:<br>

> > -  call void @llvm.AMDGPU.barrier.local()<br>

> > +  call void @llvm.r600.group.barrier()<br>

> ><br>

> > Please pardon my ignorance, but does the new intrinsic handle<br>

> > synchronizing writes/reads to local/global memory along with thread<br>

> > synchronization?<br>

> ><br>

> > If that is the case, could we modify the barrier_local portion to<br>

> > jump straight to done, so that we don't end up inserting two<br>

> > barriers if someone calls<br>

> > barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE);<br>

> ><br>

> > I don't have my NI card installed at the moment, but could if<br>

> > necessary to test things out.<br>

> ><br>

> > --Aaron<br>

> ><br>

><br>

> No, but the old intrinsics didn’t either. There is only one<br>

> synchronization instruction, the memory fence is a different problem.<br>

> I don’t think it actually matters with the constraint that global<br>

> memory is only consistent within a workgroup, although I”m less sure<br>

> how the old hardware caches worked<br>

<br>

</div></div>EG/NI both reload/flush L1 (VC and TC) caches at the beginning and end<br>

of every fetch clause. We can make the barrier instruction wait for<br>

previous CF instructions (bit 31 CF_WORD0), but afaik it only waits for<br>

instruction to complete (i.e I don't think it waits for the flush<br>

complete).<br>

<br>

There are _ACK versions of VC/TC fetch clauses, and WAIT_ACK<br>

instruction as fence, although I haven't found decisive info on whether<br>

these do wait for flushes to complete (or how they differ from using<br>

barrier bit).<br>

<br>

I agree with the point about issuing two barrier instructions. IMO the<br>

correct implementation of this functions for r600 would be:<br>

<br>

void barrier(cl_mem_fence_flags flags) {<br>

  mem_fence(flags);<br>

  __builtin_r600_barrier();<br>

}<br>

<br>

the current implementation also fails to issue any barrier if flags ==<br>

0, so reducing it to:<br>

<span class=""><br>

define void @barrier(i32 %flags) #2 {<br>

</span>  call void @llvm.r600.group.barrier()<br>

  ret void<br>

}<br>

<br>

would be an improvement<br></blockquote><div><br></div><div>Yeah, agreed.  I've written code in the past that attempted to do barrier(0) to synchronize threads in a group without requiring synchronization of local/global memory...  I'm not saying it was good code, but the barrier was required in that case.  <br><br></div><div>I'd be fine with changing to the implementation you suggested.  I know it's not the first time we've discussed fixing our barrier implementation.  Hopefully we actually finish it this time :)<br><br></div><div>--Aaron<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

<br>

jan<br>

><br>

> -Matt<br>

<div class="HOEnZb"><div class="h5">><br>

> _______________________________________________<br>

> Libclc-dev mailing list<br>

> <a href="mailto:Libclc-dev@lists.llvm.org">Libclc-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev</a><br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Jan Vesely <<a href="mailto:jan.vesely@rutgers.edu">jan.vesely@rutgers.edu</a>></font></span></blockquote></div><br></div></div>