[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Wed Jun 14 17:23:29 PDT 2017

On 06/14/2017 05:05 PM, Connor Abbott wrote:
> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote:
>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>
>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>
>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>> cc some people who have worked on this.
>>>>>>>
>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've been looking into how to implement the more advanced Shader Model
>>>>>>>> 6 reduction operations in radv (and obviously most of the work would
>>>>>>>> be useful for radeonsi too). They're explained in the spec for
>>>>>>>> GL_AMD_shader_ballot at
>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>>>>>> but I'll summarize them here. There are two types of operations:
>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>> operations. The reductions can be implemented in terms of the prefix
>>>>>>>> scan (although in practice I don't think we want to implement them in
>>>>>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>>>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>> and an input value `a' (that's really a SIMD array with one value per
>>>>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>>>>>> work for non-uniform control flow: it simply skips inactive
>>>>>>>> invocations.
>>>>>>>>
>>>>>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>>>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>>>>>> I can think of a few concerns/questions. First of all, to implement
>>>>>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>>>>>> this, modified from
>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>
>>>>>>>> ; v0 is the input register
>>>>>>>> v_mov_b32 v1, v0
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>>>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>>>>>> v_nop
>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>>
>>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>>> instruction:
>>>>>>>>
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>
>>>>>>>> What it's doing is shifting v0 right by one within each row and adding
>>>>>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>>>>>> this, in LLVM-like pseduocode:
>>>>>>>>
>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>>>>>> %result = foo %tmp, %input
>>>>>>>>
>>>>>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>>>>>> with the dpp modifier, and no restrictions on the destination). We
>>>>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>>>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>
>>>>>>
>>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>>>>>> or %input (bound_ctrl = 1)?
>>>>>
>>>>> Oh, maybe it is... for that to happen the underlying move would need
>>>>> to have the source and destination constrained to be the same. I
>>>>> couldn't see that constraint anywhere I looked, but I'm not an expert,
>>>>> so I may have overlooked it. In any case, that behavior still isn't
>>>>> what we want if we want to implement the prefix scan operations
>>>>> efficiently.
>>>>>
>>>>
>>>> Ok, I see what you are saying now.  I think the best option here is to
>>>> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
>>>> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
>>>> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
>>>> of V_MOV_B32_dpp where the src and dst are the same register.
>>>>
>>>> -Tom
>>>
>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>
>>
>> I think re-defining the existing intrinsic will be easier than adding a new
>> one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken
>> for the bound_ctrl=1 case, which isn't really ideal.
> 
> You can't re-define the existing intrinsic. The problem is that a move
> with DPP doesn't just write to its destination register - it modifies
> it, i.e. it reads and then writes to it. You can't express that with
> SSA, since you can only ever write to an SSA value once. I think what
> you want is an intrinsic, say llvm.amdgcn.update.dpp (dunno what you'd
> call it?) that takes the value to use if the read is invalid. That is,
> something like:
> 
> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
> 

This is a more general form of what I'm suggesting.  If you re-define
(when I say re-define, I just mean document this behavior)
llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value
of %src will be written to %dst for invalid lanes, then this will
have the same effect as llvm.amdgcn.update.dpp when %old = %update.

I can see the usefulness now of having this extra intrinsic,
but I still think it would be good to fix llvm.amdgcn.mov.dpp.
At the MachineIR level, I think you could handle the new intrinsic
and llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same
pseudo instructions, so fixing llvm.amdgcn.mov.dpp may not
be so hard if you are adding the new intrinsic.

> would get turned into:
> 
> v_mov_b32 %new, %old
> v_mov_b32 %new, %update (dpp control)
> 
> And hopefully coalescing will remove the first copy. Think of it as
> like lowering LLVM's three-address form for e.g. ADD to the x86
> two-address form. If you wanted to get fancy, you could also recognize
> the case where the old value is zero and turn that into a DPP move
> with bound_ctrl = 1.
> 
> Also, remember that the whole point of this was to be able to express
> stuff like:
> 
> v_min_f32 v1, v0, v1 (dpp control)
> 
> where you take the minimum of v1 and the swizzled v0, except where you
> would've read an invalid lane for v0, you read the old value for v1
> instead. For operations like add and iadd where the identity is 0, you
> can set bound_ctrl = 1, and then the optimizer can safely fold the
> v_mov_b32 into the operation itself. That is, you'd do:
> 
> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
> %new = i32 add %swizzled, %old
> 
> and after coalescing, register allocation, etc. the backend would turn
> that into:
> 
> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
> 
> which is functionally equivalent to the version without the bound_ctrl.
> 
> Otherwise, for operations `op' where a `op' a == a, you can do something like:
> 
> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
> %new = f32 fmin %swizzled, %old
> 
> and then if the optimizer is clever enough, it can also fold it into
> one instruction since the invalid lanes will get their old values
> back. I think that covers all the cases we care about. The question is
> whether it's better to take that route, or whether it's better to just
> add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, etc. so
> that:
> 

I would only go the route of adding a dpp intrinsic for every operation
if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
The main reason for this is that you will lose the generic combines that
LLVM has for add, min, etc.

-Tom
> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
> 
> turns into:
> 
> v_mov_b32 %new, %old
> v_min_f32 %new, %src0, %src1 (dpp_control)
> 
> and then we can get what we want directly, without have to do much
> optimization except for coalescing that already exists. The downside
> is that we'd have to add a lot more intrinsics (one for min, max,
> fmin, fmax, add, and iadd).
> 
>>
>> -Tom
>>
>>> -Matt
>>>
>>