[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Thu Jun 15 14:56:36 PDT 2017

On Thu, Jun 15, 2017 at 6:56 AM, Sumner, Brian <Brian.Sumner at amd.com> wrote:
> I'm wondering about the focus on bound_cntl.  Any cleared bit in the row_mask or bank_mask will also disable updating the result.
> Brian

True. Although, the same issue happens if you clear a bit in row_mask
or bank_mask as well.

>
> -----Original Message-----
> From: Connor Abbott [mailto:cwabbott0 at gmail.com]
> Sent: Wednesday, June 14, 2017 6:13 PM
> To: tstellar at redhat.com
> Cc: Matt Arsenault; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
> Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
>
> On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at redhat.com> wrote:
>> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote:
>>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>>
>>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>
>>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>>>> cc some people who have worked on this.
>>>>>>>>>
>>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> I've been looking into how to implement the more advanced
>>>>>>>>>> Shader Model
>>>>>>>>>> 6 reduction operations in radv (and obviously most of the work
>>>>>>>>>> would be useful for radeonsi too). They're explained in the
>>>>>>>>>> spec for GL_AMD_shader_ballot at
>>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha
>>>>>>>>>> der_ballot.txt, but I'll summarize them here. There are two
>>>>>>>>>> types of operations:
>>>>>>>>>> reductions that always return a uniform value, and prefix scan
>>>>>>>>>> operations. The reductions can be implemented in terms of the
>>>>>>>>>> prefix scan (although in practice I don't think we want to
>>>>>>>>>> implement them in exactly the same way), and the concerns are
>>>>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>>>> and an input value `a' (that's really a SIMD array with one
>>>>>>>>>> value per invocation, even though it's a scalar value in
>>>>>>>>>> LLVM), the prefix scan returns a[0] in invocation 0, a[0] `op'
>>>>>>>>>> a[1] in invocation 1, a[0] `op' a[1] `op' a[2] in invocation
>>>>>>>>>> 2, etc. The prefix scan will also work for non-uniform control
>>>>>>>>>> flow: it simply skips inactive invocations.
>>>>>>>>>>
>>>>>>>>>> On the LLVM side, I think that we have most of the
>>>>>>>>>> AMD-specific low-level shuffle intrinsics implemented that you
>>>>>>>>>> need to do this, but I can think of a few concerns/questions.
>>>>>>>>>> First of all, to implement the prefix scan, we'll need to do a
>>>>>>>>>> code sequence that looks like this, modified from
>>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
>>>>>>>>>> (replace
>>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>>
>>>>>>>>>> ; v0 is the input register
>>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add
>>>>>>>>>> two independent instructions to avoid a data hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>>>>>>>> v_nop // Add two independent instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>>>>>>>> v_nop // Add two independent instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction
>>>>>>>>>> 6 v_nop // Add two independent instructions to avoid a data
>>>>>>>>>> hazard v_nop
>>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction
>>>>>>>>>> 7
>>>>>>>>>>
>>>>>>>>>> The problem is that the way these instructions use the DPP
>>>>>>>>>> word isn't currently expressible in LLVM. We have the
>>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For
>>>>>>>>>> example, take the first
>>>>>>>>>> instruction:
>>>>>>>>>>
>>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>>
>>>>>>>>>> What it's doing is shifting v0 right by one within each row
>>>>>>>>>> and adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as
>>>>>>>>>> something like this, in LLVM-like pseduocode:
>>>>>>>>>>
>>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo
>>>>>>>>>> %tmp, %input
>>>>>>>>>>
>>>>>>>>>> but this is incorrect. If I'm reading the source correctly,
>>>>>>>>>> this will make %tmp garbage in lane 0 (since it just turns
>>>>>>>>>> into a normal move with the dpp modifier, and no restrictions
>>>>>>>>>> on the destination). We could set bound_ctrl to 0 to work
>>>>>>>>>> around this, since it will make %tmp
>>>>>>>>>> 0 in lane 0, but that won't work with operations whose
>>>>>>>>>> identity is
>>>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Why is %tmp garbage?  I thought the two options were 0
>>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)?
>>>>>>>
>>>>>>> Oh, maybe it is... for that to happen the underlying move would
>>>>>>> need to have the source and destination constrained to be the
>>>>>>> same. I couldn't see that constraint anywhere I looked, but I'm
>>>>>>> not an expert, so I may have overlooked it. In any case, that
>>>>>>> behavior still isn't what we want if we want to implement the
>>>>>>> prefix scan operations efficiently.
>>>>>>>
>>>>>>
>>>>>> Ok, I see what you are saying now.  I think the best option here
>>>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp
>>>>>> intrinsic is to copy its src operand to dst when bound_ctrl = 1
>>>>>> and it reads from an invalid thread, and then when bound_ctrl=1,
>>>>>> lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>>>>
>>>>>> -Tom
>>>>>
>>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>>>
>>>>
>>>> I think re-defining the existing intrinsic will be easier than
>>>> adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp
>>>> will still be broken for the bound_ctrl=1 case, which isn't really ideal.
>>>
>>> You can't re-define the existing intrinsic. The problem is that a
>>> move with DPP doesn't just write to its destination register - it
>>> modifies it, i.e. it reads and then writes to it. You can't express
>>> that with SSA, since you can only ever write to an SSA value once. I
>>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp
>>> (dunno what you'd call it?) that takes the value to use if the read
>>> is invalid. That is, something like:
>>>
>>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>>
>>
>> This is a more general form of what I'm suggesting.  If you re-define
>> (when I say re-define, I just mean document this behavior)
>> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src
>> will be written to %dst for invalid lanes, then this will have the
>> same effect as llvm.amdgcn.update.dpp when %old = %update.
>>
>> I can see the usefulness now of having this extra intrinsic, but I
>> still think it would be good to fix llvm.amdgcn.mov.dpp.
>> At the MachineIR level, I think you could handle the new intrinsic and
>> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo
>> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you
>> are adding the new intrinsic.
>
> Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to generate the best possible code sequence, and it's a strict superset of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to not bother fixing it and just deprecate it instead.
>
>>
>>
>>> would get turned into:
>>>
>>> v_mov_b32 %new, %old
>>> v_mov_b32 %new, %update (dpp control)
>>>
>>> And hopefully coalescing will remove the first copy. Think of it as
>>> like lowering LLVM's three-address form for e.g. ADD to the x86
>>> two-address form. If you wanted to get fancy, you could also
>>> recognize the case where the old value is zero and turn that into a
>>> DPP move with bound_ctrl = 1.
>>>
>>> Also, remember that the whole point of this was to be able to express
>>> stuff like:
>>>
>>> v_min_f32 v1, v0, v1 (dpp control)
>>>
>>> where you take the minimum of v1 and the swizzled v0, except where
>>> you would've read an invalid lane for v0, you read the old value for
>>> v1 instead. For operations like add and iadd where the identity is 0,
>>> you can set bound_ctrl = 1, and then the optimizer can safely fold
>>> the
>>> v_mov_b32 into the operation itself. That is, you'd do:
>>>
>>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control)
>>> %new = i32 add %swizzled, %old
>>>
>>> and after coalescing, register allocation, etc. the backend would
>>> turn that into:
>>>
>>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>>
>>> which is functionally equivalent to the version without the bound_ctrl.
>>>
>>> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>>>
>>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>> %new = f32 fmin %swizzled, %old
>>>
>>> and then if the optimizer is clever enough, it can also fold it into
>>> one instruction since the invalid lanes will get their old values
>>> back. I think that covers all the cases we care about. The question
>>> is whether it's better to take that route, or whether it's better to
>>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp,
>>> etc. so
>>> that:
>>>
>>
>> I would only go the route of adding a dpp intrinsic for every
>> operation if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
>> The main reason for this is that you will lose the generic combines
>> that LLVM has for add, min, etc.
>
> Ok, good point. I think this is only going to be used in a few specific scenarios, but I can see stuff like fusing add + into mad being useful.
>
>>
>>
>> -Tom
>>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>>
>>> turns into:
>>>
>>> v_mov_b32 %new, %old
>>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>>
>>> and then we can get what we want directly, without have to do much
>>> optimization except for coalescing that already exists. The downside
>>> is that we'd have to add a lot more intrinsics (one for min, max,
>>> fmin, fmax, add, and iadd).
>>>
>>>>
>>>> -Tom
>>>>
>>>>> -Matt
>>>>>
>>>>
>>