[Libclc-dev] [PATCH 1/1] r600: Add fence implementation, rework barrier
Matt Arsenault
Matthew.Arsenault at amd.com
Wed Apr 30 11:52:51 PDT 2014
On 04/30/2014 09:00 AM, Jan Vesely wrote:
> On Tue, 2014-04-29 at 20:50 -0500, Aaron Watry wrote:
>> On Sun, Apr 27, 2014 at 1:26 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>>> Hw flushes caches after each mem clause (Ch 8.2 EG manual),
>>> so we need to pass the information that indicates requested
>>> clause boundaries.
>>> Use existing llvm fence instruction for that.
>>>
>>> Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>>> ---
>>>
>>> I'm not sure whether using 'acquire' ordering for reads and 'release'
>>> for writes is the correct thing with respect to OpenCL specs.
>>> It seems to get read-write-relation correct, but I', not sure about
>>> read ordering, or write ordering. Making everything use the strongest mem_fence
>>> might be safer.
>>>
>>> This patch changes global-memory.cl piglit crash from:
>>> i32 = GlobalAddress<void ()* @llvm.AMDGPU.barrier.global> 0 [ORD=11]
>>> Undefined function
>>> to:
>>> LLVM ERROR: Cannot select: 0x24b9f58: i32,ch = load 0x2090108, 0x24b9718, 0x24b9c40<LD1[@barrier_test_local_int.test(addrspace=3)], zext from i1> [ORD=4] [ID=13]
>>> 0x24b9718: i32 = Constant<0> [ID=7]
>>> 0x24b9c40: i32 = undef [ID=2]
>>> In function: barrier_test_local_int
>>>
>>> Without the fences the piglit just fails.
>>>
>>>
>>> r600/lib/SOURCES | 1 +
>>> r600/lib/synchronization/barrier_impl.ll | 25 ++----------------
>>> r600/lib/synchronization/fence_impl.ll | 44 ++++++++++++++++++++++++++++++++
>>> 3 files changed, 47 insertions(+), 23 deletions(-)
>>> create mode 100644 r600/lib/synchronization/fence_impl.ll
>>>
>>> diff --git a/r600/lib/SOURCES b/r600/lib/SOURCES
>>> index d9fc897..d93ec0f 100644
>>> --- a/r600/lib/SOURCES
>>> +++ b/r600/lib/SOURCES
>>> @@ -7,4 +7,5 @@ workitem/get_local_id.ll
>>> workitem/get_global_size.ll
>>> synchronization/barrier.cl
>>> synchronization/barrier_impl.ll
>>> +synchronization/fence_impl.ll
>>> shared/vload.cl
>>> diff --git a/r600/lib/synchronization/barrier_impl.ll b/r600/lib/synchronization/barrier_impl.ll
>>> index 3d8ee66..2f2d865 100644
>>> --- a/r600/lib/synchronization/barrier_impl.ll
>>> +++ b/r600/lib/synchronization/barrier_impl.ll
>>> @@ -1,29 +1,8 @@
>>> -declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
>>> -declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
>>> +declare void @mem_fence(i32 %flags) nounwind alwaysinline
>>> declare void @llvm.AMDGPU.barrier.local() nounwind noduplicate
>>> -declare void @llvm.AMDGPU.barrier.global() nounwind noduplicate
>>>
>>> define void @barrier(i32 %flags) nounwind noduplicate alwaysinline {
>>> -barrier_local_test:
>>> - %CLK_LOCAL_MEM_FENCE = call i32 @__clc_clk_local_mem_fence()
>>> - %0 = and i32 %flags, %CLK_LOCAL_MEM_FENCE
>>> - %1 = icmp ne i32 %0, 0
>>> - br i1 %1, label %barrier_local, label %barrier_global_test
>>> -
>>> -barrier_local:
>>> + call void @mem_fence(i32 %flags)
>>> call void @llvm.AMDGPU.barrier.local() noduplicate
>>> - br label %barrier_global_test
>>> -
>>> -barrier_global_test:
>>> - %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>> - %2 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>> - %3 = icmp ne i32 %2, 0
>>> - br i1 %3, label %barrier_global, label %done
>>> -
>>> -barrier_global:
>>> - call void @llvm.AMDGPU.barrier.global() noduplicate
>>> - br label %done
>>> -
>>> -done:
>>> ret void
>>> }
>>> diff --git a/r600/lib/synchronization/fence_impl.ll b/r600/lib/synchronization/fence_impl.ll
>>> new file mode 100644
>>> index 0000000..a72e70e
>>> --- /dev/null
>>> +++ b/r600/lib/synchronization/fence_impl.ll
>>> @@ -0,0 +1,44 @@
>>> +declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
>>> +declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
>>> +
>>> +define void @mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
>>> + %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>> + %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>> + %2 = icmp ne i32 %1, 0
>>> + br i1 %2, label %global_fence, label %exit_fence
>>> +
>>> +global_fence:
>>> + fence seq_cst
>>> + br label %exit_fence
>>> +
>>> +exit_fence:
>>> + ret void
>>> +}
>>> +
>>> +define void @read_mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
>>> + %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>> + %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>> + %2 = icmp ne i32 %1, 0
>>> + br i1 %2, label %global_fence, label %exit_fence
>>> +
>>> +global_fence:
>> The label global_fence is defined three times in this file. That may
>> be ok (compiler isn't yelling about it, probably because they're in
>> different functions), but it does rub me the wrong way...
>>
>> I'll echo Tom's comments. For now, we might end up needing to use
>> seq_cst, but if we can get clarification, that'd be good. I also get
>> the same AtomicFence selection error on my machines as Tom's
>> getting... but that could be that we're just running a different LLVM
>> version than you are (maybe?).
> That's the intended behavior for now (see comments section of the
> patch).
> I don't have the llvm part yet, it will take me some time to understand
> how memory instructions are put into clauses, and how to use fences in
> the process.
>
> The reason I posted the patch was to get an some feedback on whether it
> makes sense to use existing llvm fence intrinsics to pass the
> information to llvm, and whether it makes sense to use separate
> intrinsics for read/write fences, since we'll need to treat them the
> same anyway (at least that's my understanding of the memory part of isa
> manual).
> I should have marked the patch as RFC/WIP.
>
> My original approach implemented read/write fences by just calling
> mem_fence(). That should take care of both using only seq_cst and
> duplicate labels. I can post is as a v2 if you are ok with a patch that
> is not really useful on its own.
>
> regards,
> Jan
>
You can't use the LLVM atomic fence instruction for this. For reasons I
do not understand, the atomic fence LLVM instruction only impacts the
ordering of other atomic instructions, but OpenCL mem_fence is for all
memory accesses
More information about the Libclc-dev
mailing list