[Libclc-dev] [PATCH 1/1] r600: Add fence implementation, rework barrier

Wed Apr 30 11:54:12 PDT 2014

On 04/30/2014 11:52 AM, Matt Arsenault wrote:
> On 04/30/2014 09:00 AM, Jan Vesely wrote:
>> On Tue, 2014-04-29 at 20:50 -0500, Aaron Watry wrote:
>>> On Sun, Apr 27, 2014 at 1:26 PM, Jan Vesely <jan.vesely at rutgers.edu> 
>>> wrote:
>>>> Hw flushes caches after each mem clause (Ch 8.2 EG manual),
>>>> so we need to pass the information that indicates requested
>>>> clause boundaries.
>>>> Use existing llvm fence instruction for that.
>>>>
>>>> Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>>>> ---
>>>>
>>>> I'm not sure whether using 'acquire' ordering for reads and 'release'
>>>> for writes is the correct thing with respect to OpenCL specs.
>>>> It seems to get read-write-relation correct, but I', not sure about
>>>> read ordering, or write ordering. Making everything use the 
>>>> strongest mem_fence
>>>> might be safer.
>>>>
>>>> This patch changes global-memory.cl piglit crash from:
>>>> i32 = GlobalAddress<void ()* @llvm.AMDGPU.barrier.global> 0 [ORD=11]
>>>> Undefined function
>>>> to:
>>>> LLVM ERROR: Cannot select: 0x24b9f58: i32,ch = load 0x2090108, 
>>>> 0x24b9718, 
>>>> 0x24b9c40<LD1[@barrier_test_local_int.test(addrspace=3)], zext from 
>>>> i1> [ORD=4] [ID=13]
>>>>    0x24b9718: i32 = Constant<0> [ID=7]
>>>>    0x24b9c40: i32 = undef [ID=2]
>>>> In function: barrier_test_local_int
>>>>
>>>> Without the fences the piglit just fails.
>>>>
>>>>
>>>>   r600/lib/SOURCES                         |  1 +
>>>>   r600/lib/synchronization/barrier_impl.ll | 25 ++----------------
>>>>   r600/lib/synchronization/fence_impl.ll   | 44 
>>>> ++++++++++++++++++++++++++++++++
>>>>   3 files changed, 47 insertions(+), 23 deletions(-)
>>>>   create mode 100644 r600/lib/synchronization/fence_impl.ll
>>>>
>>>> diff --git a/r600/lib/SOURCES b/r600/lib/SOURCES
>>>> index d9fc897..d93ec0f 100644
>>>> --- a/r600/lib/SOURCES
>>>> +++ b/r600/lib/SOURCES
>>>> @@ -7,4 +7,5 @@ workitem/get_local_id.ll
>>>>   workitem/get_global_size.ll
>>>>   synchronization/barrier.cl
>>>>   synchronization/barrier_impl.ll
>>>> +synchronization/fence_impl.ll
>>>>   shared/vload.cl
>>>> diff --git a/r600/lib/synchronization/barrier_impl.ll 
>>>> b/r600/lib/synchronization/barrier_impl.ll
>>>> index 3d8ee66..2f2d865 100644
>>>> --- a/r600/lib/synchronization/barrier_impl.ll
>>>> +++ b/r600/lib/synchronization/barrier_impl.ll
>>>> @@ -1,29 +1,8 @@
>>>> -declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
>>>> -declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
>>>> +declare void @mem_fence(i32 %flags) nounwind alwaysinline
>>>>   declare void @llvm.AMDGPU.barrier.local() nounwind noduplicate
>>>> -declare void @llvm.AMDGPU.barrier.global() nounwind noduplicate
>>>>
>>>>   define void @barrier(i32 %flags) nounwind noduplicate alwaysinline {
>>>> -barrier_local_test:
>>>> -  %CLK_LOCAL_MEM_FENCE = call i32 @__clc_clk_local_mem_fence()
>>>> -  %0 = and i32 %flags, %CLK_LOCAL_MEM_FENCE
>>>> -  %1 = icmp ne i32 %0, 0
>>>> -  br i1 %1, label %barrier_local, label %barrier_global_test
>>>> -
>>>> -barrier_local:
>>>> +  call void @mem_fence(i32 %flags)
>>>>     call void @llvm.AMDGPU.barrier.local() noduplicate
>>>> -  br label %barrier_global_test
>>>> -
>>>> -barrier_global_test:
>>>> -  %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>>> -  %2 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>>> -  %3 = icmp ne i32 %2, 0
>>>> -  br i1 %3, label %barrier_global, label %done
>>>> -
>>>> -barrier_global:
>>>> -  call void @llvm.AMDGPU.barrier.global() noduplicate
>>>> -  br label %done
>>>> -
>>>> -done:
>>>>     ret void
>>>>   }
>>>> diff --git a/r600/lib/synchronization/fence_impl.ll 
>>>> b/r600/lib/synchronization/fence_impl.ll
>>>> new file mode 100644
>>>> index 0000000..a72e70e
>>>> --- /dev/null
>>>> +++ b/r600/lib/synchronization/fence_impl.ll
>>>> @@ -0,0 +1,44 @@
>>>> +declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
>>>> +declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
>>>> +
>>>> +define void @mem_fence(i32 %flags) nounwind noduplicate 
>>>> alwaysinline {
>>>> +  %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>>> +  %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>>> +  %2 = icmp ne i32 %1, 0
>>>> +  br i1 %2, label %global_fence, label %exit_fence
>>>> +
>>>> +global_fence:
>>>> +  fence seq_cst
>>>> +  br label %exit_fence
>>>> +
>>>> +exit_fence:
>>>> +  ret void
>>>> +}
>>>> +
>>>> +define void @read_mem_fence(i32 %flags) nounwind noduplicate 
>>>> alwaysinline {
>>>> +  %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>>> +  %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>>> +  %2 = icmp ne i32 %1, 0
>>>> +  br i1 %2, label %global_fence, label %exit_fence
>>>> +
>>>> +global_fence:
>>> The label global_fence is defined three times in this file. That may
>>> be ok (compiler isn't yelling about it, probably because they're in
>>> different functions), but it does rub me the wrong way...
>>>
>>> I'll echo Tom's comments.  For now, we might end up needing to use
>>> seq_cst, but if we can get clarification, that'd be good.  I also get
>>> the same AtomicFence selection error on my machines as Tom's
>>> getting... but that could be that we're just running a different LLVM
>>> version than you are (maybe?).
>> That's the intended behavior for now (see comments section of the
>> patch).
>> I don't have the llvm part yet, it will take me some time to understand
>> how memory instructions are put into clauses, and how to use fences in
>> the process.
>>
>> The reason I posted the patch was to get an some feedback on whether it
>> makes sense to use existing llvm fence intrinsics to pass the
>> information to llvm, and whether it makes sense to use separate
>> intrinsics for read/write fences, since we'll need to treat them the
>> same anyway (at least that's my understanding of the memory part of isa
>> manual).
>> I should have marked the patch as RFC/WIP.
>>
>> My original approach implemented read/write fences by just calling
>> mem_fence(). That should take care of both using only seq_cst and
>> duplicate labels. I can post is as a v2 if you are ok with a patch that
>> is not really useful on its own.
>>
>> regards,
>> Jan
>>
> You can't use the LLVM atomic fence instruction for this. For reasons 
> I do not understand, the atomic fence LLVM instruction only impacts 
> the ordering of other atomic instructions, but OpenCL mem_fence is for 
> all memory accesses

Plus the LLVM fence instruction doesn't specify the address space(s)