[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Wed Jun 14 18:12:40 PDT 2017

On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at redhat.com> wrote:
> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote:
>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>
>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>
>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>>> cc some people who have worked on this.
>>>>>>>>
>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>>>>> GL_AMD_shader_ballot at
>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>>>>> invocations.
>>>>>>>>>
>>>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>>>>> this, modified from
>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>
>>>>>>>>> ; v0 is the input register
>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>>> v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>>>
>>>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>>>> instruction:
>>>>>>>>>
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>
>>>>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>>>>> this, in LLVM-like pseduocode:
>>>>>>>>>
>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>>>>> %result = foo %tmp, %input
>>>>>>>>>
>>>>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>>
>>>>>>>
>>>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>>>>> or %input (bound_ctrl = 1)?
>>>>>>
>>>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>>>> to have the source and destination constrained to be the same. I
>>>>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>>>>> so I may have overlooked it. In any case, that behavior still isn't
>>>>>> what we want if we want to implement the prefix scan operations
>>>>>> efficiently.
>>>>>>
>>>>>
>>>>> Ok, I see what you are saying now.  I think the best option here is to
>>>>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>>>>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>>>>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>>>>> of V_MOV_B32_dpp where the src and dst are the same register.
>>>>>
>>>>> -Tom
>>>>
>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>>
>>>
>>> I think re-defining the existing intrinsic will be easier than adding a new
>>> one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
>>> for the bound_ctrl=1 case, which isn't really ideal.
>>
>> You can't re-define the existing intrinsic. The problem is that a move
>> with DPP doesn't just write to its destination register - it modifies
>> it, i.e. it reads and then writes to it. You can't express that with
>> SSA, since you can only ever write to an SSA value once. I think what
>> you want is an intrinsic, say llvm.amdgcn.update.dpp (dunno what you'd
>> call it?) that takes the value to use if the read is invalid. That is,
>> something like:
>>
>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>
>
> This is a more general form of what I'm suggesting.  If you re-define
> (when I say re-define, I just mean document this behavior)
> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value
> of %src will be written to %dst for invalid lanes, then this will
> have the same effect as llvm.amdgcn.update.dpp when %old = %update.
>
> I can see the usefulness now of having this extra intrinsic,
> but I still think it would be good to fix llvm.amdgcn.mov.dpp.
> At the MachineIR level, I think you could handle the new intrinsic
> and llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same
> pseudo instructions, so fixing llvm.amdgcn.mov.dpp may not
> be so hard if you are adding the new intrinsic.

Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to
generate the best possible code sequence, and it's a strict superset
of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm
inclined to not bother fixing it and just deprecate it instead.

>
>
>> would get turned into:
>>
>> v_mov_b32 %new, %old
>> v_mov_b32 %new, %update (dpp control)
>>
>> And hopefully coalescing will remove the first copy. Think of it as
>> like lowering LLVM's three-address form for e.g. ADD to the x86
>> two-address form. If you wanted to get fancy, you could also recognize
>> the case where the old value is zero and turn that into a DPP move
>> with bound_ctrl = 1.
>>
>> Also, remember that the whole point of this was to be able to express
>> stuff like:
>>
>> v_min_f32 v1, v0, v1 (dpp control)
>>
>> where you take the minimum of v1 and the swizzled v0, except where you
>> would've read an invalid lane for v0, you read the old value for v1
>> instead. For operations like add and iadd where the identity is 0, you
>> can set bound_ctrl = 1, and then the optimizer can safely fold the
>> v_mov_b32 into the operation itself. That is, you'd do:
>>
>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
>> %new = i32 add %swizzled, %old
>>
>> and after coalescing, register allocation, etc. the backend would turn
>> that into:
>>
>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>
>> which is functionally equivalent to the version without the bound_ctrl.
>>
>> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>>
>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>> %new = f32 fmin %swizzled, %old
>>
>> and then if the optimizer is clever enough, it can also fold it into
>> one instruction since the invalid lanes will get their old values
>> back. I think that covers all the cases we care about. The question is
>> whether it's better to take that route, or whether it's better to just
>> add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, etc. so
>> that:
>>
>
> I would only go the route of adding a dpp intrinsic for every operation
> if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
> The main reason for this is that you will lose the generic combines that
> LLVM has for add, min, etc.

Ok, good point. I think this is only going to be used in a few
specific scenarios, but I can see stuff like fusing add + into mad
being useful.

>
>
> -Tom
>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>
>> turns into:
>>
>> v_mov_b32 %new, %old
>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>
>> and then we can get what we want directly, without have to do much
>> optimization except for coalescing that already exists. The downside
>> is that we'd have to add a lot more intrinsics (one for min, max,
>> fmin, fmax, add, and iadd).
>>
>>>
>>> -Tom
>>>
>>>> -Matt
>>>>
>>>
>