No subject

Sun Apr 27 11:26:59 PDT 2014

BARTS (6850 it looks like any barrier(CLK_GLOBAL_MEM_FENCE) we do has
to request acknowledgement at the time we issue the read
(MEM_RAT_CACHELESS_STORE_RAW_ACK and then use WAIT_ACK to wait until
the number of pending acknowledgements falls to <= 0, after which we
then also issue a GROUP_BARRIER instruction. So basically the
mem_fence gets issued/processed, followed by the barrier instruction.

I'm not sure if this implies that we need to look back in the command
stream to change the store instruction to request acknowledgement...
or if there's a way around that.

If anyone wants the source kernel and binary dumps of the IL/ISA, I'm
perfectly willing to share. Pretty simple stuff, though.  Just declare
a private int, set it, store it globally, and then
barrier(CLK_GLOBAL_MEM_FENCE).

--Aaron

On Wed, Apr 30, 2014 at 1:52 PM, Matt Arsenault
<Matthew.Arsenault at amd.com> wrote:
> On 04/30/2014 09:00 AM, Jan Vesely wrote:
>>
>> On Tue, 2014-04-29 at 20:50 -0500, Aaron Watry wrote:
>>>
>>> On Sun, Apr 27, 2014 at 1:26 PM, Jan Vesely <jan.vesely at rutgers.edu>
>>> wrote:
>>>>
>>>> Hw flushes caches after each mem clause (Ch 8.2 EG manual),
>>>> so we need to pass the information that indicates requested
>>>> clause boundaries.
>>>> Use existing llvm fence instruction for that.
>>>>
>>>> Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>>>> ---
>>>>
>>>> I'm not sure whether using 'acquire' ordering for reads and 'release'
>>>> for writes is the correct thing with respect to OpenCL specs.
>>>> It seems to get read-write-relation correct, but I', not sure about
>>>> read ordering, or write ordering. Making everything use the strongest
>>>> mem_fence
>>>> might be safer.
>>>>
>>>> This patch changes global-memory.cl piglit crash from:
>>>> i32 = GlobalAddress<void ()* @llvm.AMDGPU.barrier.global> 0 [ORD=11]
>>>> Undefined function
>>>> to:
>>>> LLVM ERROR: Cannot select: 0x24b9f58: i32,ch = load 0x2090108,
>>>> 0x24b9718, 0x24b9c40<LD1[@barrier_test_local_int.test(addrspace=3)], zext
>>>> from i1> [ORD=4] [ID=13]
>>>>    0x24b9718: i32 = Constant<0> [ID=7]
>>>>    0x24b9c40: i32 = undef [ID=2]
>>>> In function: barrier_test_local_int
>>>>
>>>> Without the fences the piglit just fails.
>>>>
>>>>
>>>>   r600/lib/SOURCES                         |  1 +
>>>>   r600/lib/synchronization/barrier_impl.ll | 25 ++----------------
>>>>   r600/lib/synchronization/fence_impl.ll   | 44
>>>> ++++++++++++++++++++++++++++++++
>>>>   3 files changed, 47 insertions(+), 23 deletions(-)
>>>>   create mode 100644 r600/lib/synchronization/fence_impl.ll
>>>>
>>>> diff --git a/r600/lib/SOURCES b/r600/lib/SOURCES
>>>> index d9fc897..d93ec0f 100644
>>>> --- a/r600/lib/SOURCES
>>>> +++ b/r600/lib/SOURCES
>>>> @@ -7,4 +7,5 @@ workitem/get_local_id.ll
>>>>   workitem/get_global_size.ll
>>>>   synchronization/barrier.cl
>>>>   synchronization/barrier_impl.ll
>>>> +synchronization/fence_impl.ll
>>>>   shared/vload.cl
>>>> diff --git a/r600/lib/synchronization/barrier_impl.ll
>>>> b/r600/lib/synchronization/barrier_impl.ll
>>>> index 3d8ee66..2f2d865 100644
>>>> --- a/r600/lib/synchronization/barrier_impl.ll
>>>> +++ b/r600/lib/synchronization/barrier_impl.ll
>>>> @@ -1,29 +1,8 @@
>>>> -declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
>>>> -declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
>>>> +declare void @mem_fence(i32 %flags) nounwind alwaysinline
>>>>   declare void @llvm.AMDGPU.barrier.local() nounwind noduplicate
>>>> -declare void @llvm.AMDGPU.barrier.global() nounwind noduplicate
>>>>
>>>>   define void @barrier(i32 %flags) nounwind noduplicate alwaysinline {
>>>> -barrier_local_test:
>>>> -  %CLK_LOCAL_MEM_FENCE = call i32 @__clc_clk_local_mem_fence()
>>>> -  %0 = and i32 %flags, %CLK_LOCAL_MEM_FENCE
>>>> -  %1 = icmp ne i32 %0, 0
>>>> -  br i1 %1, label %barrier_local, label %barrier_global_test
>>>> -
>>>> -barrier_local:
>>>> +  call void @mem_fence(i32 %flags)
>>>>     call void @llvm.AMDGPU.barrier.local() noduplicate
>>>> -  br label %barrier_global_test
>>>> -
>>>> -barrier_global_test:
>>>> -  %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>>> -  %2 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>>> -  %3 = icmp ne i32 %2, 0
>>>> -  br i1 %3, label %barrier_global, label %done
>>>> -
>>>> -barrier_global:
>>>> -  call void @llvm.AMDGPU.barrier.global() noduplicate
>>>> -  br label %done
>>>> -
>>>> -done:
>>>>     ret void
>>>>   }
>>>> diff --git a/r600/lib/synchronization/fence_impl.ll
>>>> b/r600/lib/synchronization/fence_impl.ll
>>>> new file mode 100644
>>>> index 0000000..a72e70e
>>>> --- /dev/null
>>>> +++ b/r600/lib/synchronization/fence_impl.ll
>>>> @@ -0,0 +1,44 @@
>>>> +declare i32 @__clc_clk_local_mem_fence() nounwind alwaysinline
>>>> +declare i32 @__clc_clk_global_mem_fence() nounwind alwaysinline
>>>> +
>>>> +define void @mem_fence(i32 %flags) nounwind noduplicate alwaysinline {
>>>> +  %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>>> +  %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>>> +  %2 = icmp ne i32 %1, 0
>>>> +  br i1 %2, label %global_fence, label %exit_fence
>>>> +
>>>> +global_fence:
>>>> +  fence seq_cst
>>>> +  br label %exit_fence
>>>> +
>>>> +exit_fence:
>>>> +  ret void
>>>> +}
>>>> +
>>>> +define void @read_mem_fence(i32 %flags) nounwind noduplicate
>>>> alwaysinline {
>>>> +  %CLK_GLOBAL_MEM_FENCE = call i32 @__clc_clk_global_mem_fence()
>>>> +  %1 = and i32 %flags, %CLK_GLOBAL_MEM_FENCE
>>>> +  %2 = icmp ne i32 %1, 0
>>>> +  br i1 %2, label %global_fence, label %exit_fence
>>>> +
>>>> +global_fence:
>>>
>>> The label global_fence is defined three times in this file.  That may
>>> be ok (compiler isn't yelling about it, probably because they're in
>>> different functions), but it does rub me the wrong way...
>>>
>>> I'll echo Tom's comments.  For now, we might end up needing to use
>>> seq_cst, but if we can get clarification, that'd be good.  I also get
>>> the same AtomicFence selection error on my machines as Tom's
>>> getting... but that could be that we're just running a different LLVM
>>> version than you are (maybe?).
>>
>> That's the intended behavior for now (see comments section of the
>> patch).
>> I don't have the llvm part yet, it will take me some time to understand
>> how memory instructions are put into clauses, and how to use fences in
>> the process.
>>
>> The reason I posted the patch was to get an some feedback on whether it
>> makes sense to use existing llvm fence intrinsics to pass the
>> information to llvm, and whether it makes sense to use separate
>> intrinsics for read/write fences, since we'll need to treat them the
>> same anyway (at least that's my understanding of the memory part of isa
>> manual).
>> I should have marked the patch as RFC/WIP.
>>
>> My original approach implemented read/write fences by just calling
>> mem_fence(). That should take care of both using only seq_cst and
>> duplicate labels. I can post is as a v2 if you are ok with a patch that
>> is not really useful on its own.
>>
>> regards,
>> Jan
>>
> You can't use the LLVM atomic fence instruction for this. For reasons I do
> not understand, the atomic fence LLVM instruction only impacts the ordering
> of other atomic instructions, but OpenCL mem_fence is for all memory
> accesses