[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Tue Jun 13 16:33:19 PDT 2017

> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com> wrote:
> 
> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote:
>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>> cc some people who have worked on this.
>>>> 
>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>> Hi all,
>>>>> 
>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>> GL_AMD_shader_ballot at
>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>> but I'll summarize them here. There are two types of operations:
>>>>> reductions that always return a uniform value, and prefix scan
>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>> scan (although in practice I don't think we want to implement them in
>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>> work for non-uniform control flow: it simply skips inactive
>>>>> invocations.
>>>>> 
>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>> this, modified from
>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>> v_foo_f32 with the appropriate operation):
>>>>> 
>>>>> ; v0 is the input register
>>>>> v_mov_b32 v1, v0
>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>> v_nop
>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>> 
>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>> instruction:
>>>>> 
>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>> 
>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>> this, in LLVM-like pseduocode:
>>>>> 
>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>> %result = foo %tmp, %input
>>>>> 
>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>> non-0 like min and max. What we need is something like:
>>>>> 
>>> 
>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>> or %input (bound_ctrl = 1)?
>> 
>> Oh, maybe it is... for that to happen the underlying move would need
>> to have the source and destination constrained to be the same. I
>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>> so I may have overlooked it. In any case, that behavior still isn't
>> what we want if we want to implement the prefix scan operations
>> efficiently.
>> 
> 
> Ok, I see what you are saying now.  I think the best option here is to
> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
> of V_MOV_B32_dpp where the src and dst are the same register.
> 
> -Tom

This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.

-Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170613/caeac9d4/attachment.html>