[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Wed Jun 14 14:05:43 PDT 2017

On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote:
> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>
>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>
>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>> cc some people who have worked on this.
>>>>>>
>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>>> GL_AMD_shader_ballot at
>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>>> invocations.
>>>>>>>
>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>>> this, modified from
>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>
>>>>>>> ; v0 is the input register
>>>>>>> v_mov_b32 v1, v0
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>
>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>> instruction:
>>>>>>>
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>
>>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>>> this, in LLVM-like pseduocode:
>>>>>>>
>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>>> %result = foo %tmp, %input
>>>>>>>
>>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>
>>>>>
>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>>> or %input (bound_ctrl = 1)?
>>>>
>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>> to have the source and destination constrained to be the same. I
>>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>>> so I may have overlooked it. In any case, that behavior still isn't
>>>> what we want if we want to implement the prefix scan operations
>>>> efficiently.
>>>>
>>>
>>> Ok, I see what you are saying now.  I think the best option here is to
>>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>>> of V_MOV_B32_dpp where the src and dst are the same register.
>>>
>>> -Tom
>>
>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>
>
> I think re-defining the existing intrinsic will be easier than adding a new
> one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
> for the bound_ctrl=1 case, which isn't really ideal.

You can't re-define the existing intrinsic. The problem is that a move
with DPP doesn't just write to its destination register - it modifies
it, i.e. it reads and then writes to it. You can't express that with
SSA, since you can only ever write to an SSA value once. I think what
you want is an intrinsic, say llvm.amdgcn.update.dpp (dunno what you'd
call it?) that takes the value to use if the read is invalid. That is,
something like:

%new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)

would get turned into:

v_mov_b32 %new, %old
v_mov_b32 %new, %update (dpp control)

And hopefully coalescing will remove the first copy. Think of it as
like lowering LLVM's three-address form for e.g. ADD to the x86
two-address form. If you wanted to get fancy, you could also recognize
the case where the old value is zero and turn that into a DPP move
with bound_ctrl = 1.

Also, remember that the whole point of this was to be able to express
stuff like:

v_min_f32 v1, v0, v1 (dpp control)

where you take the minimum of v1 and the swizzled v0, except where you
would've read an invalid lane for v0, you read the old value for v1
instead. For operations like add and iadd where the identity is 0, you
can set bound_ctrl = 1, and then the optimizer can safely fold the
v_mov_b32 into the operation itself. That is, you'd do:

%swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
%new = i32 add %swizzled, %old

and after coalescing, register allocation, etc. the backend would turn
that into:

v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1

which is functionally equivalent to the version without the bound_ctrl.

Otherwise, for operations `op' where a `op' a == a, you can do something like:

%swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
%new = f32 fmin %swizzled, %old

and then if the optimizer is clever enough, it can also fold it into
one instruction since the invalid lanes will get their old values
back. I think that covers all the cases we care about. The question is
whether it's better to take that route, or whether it's better to just
add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, etc. so
that:

%new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)

turns into:

v_mov_b32 %new, %old
v_min_f32 %new, %src0, %src1 (dpp_control)

and then we can get what we want directly, without have to do much
optimization except for coalescing that already exists. The downside
is that we'd have to add a lot more intrinsics (one for min, max,
fmin, fmax, add, and iadd).

>
> -Tom
>
>> -Matt
>>
>