[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Thu Jun 15 06:56:29 PDT 2017

I'm wondering about the focus on bound_cntl.  Any cleared bit in the row_mask or bank_mask will also disable updating the result.
Brian

-----Original Message-----
From: Connor Abbott [mailto:cwabbott0 at gmail.com] 
Sent: Wednesday, June 14, 2017 6:13 PM
To: tstellar at redhat.com
Cc: Matt Arsenault; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at redhat.com> wrote:
> On 06/14/2017 05:05 PM, Connor Abbott wrote:
>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote:
>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>>>
>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>
>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>>>> cc some people who have worked on this.
>>>>>>>>
>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I've been looking into how to implement the more advanced 
>>>>>>>>> Shader Model
>>>>>>>>> 6 reduction operations in radv (and obviously most of the work 
>>>>>>>>> would be useful for radeonsi too). They're explained in the 
>>>>>>>>> spec for GL_AMD_shader_ballot at 
>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha
>>>>>>>>> der_ballot.txt, but I'll summarize them here. There are two 
>>>>>>>>> types of operations:
>>>>>>>>> reductions that always return a uniform value, and prefix scan 
>>>>>>>>> operations. The reductions can be implemented in terms of the 
>>>>>>>>> prefix scan (although in practice I don't think we want to 
>>>>>>>>> implement them in exactly the same way), and the concerns are 
>>>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>>>> and an input value `a' (that's really a SIMD array with one 
>>>>>>>>> value per invocation, even though it's a scalar value in 
>>>>>>>>> LLVM), the prefix scan returns a[0] in invocation 0, a[0] `op' 
>>>>>>>>> a[1] in invocation 1, a[0] `op' a[1] `op' a[2] in invocation 
>>>>>>>>> 2, etc. The prefix scan will also work for non-uniform control 
>>>>>>>>> flow: it simply skips inactive invocations.
>>>>>>>>>
>>>>>>>>> On the LLVM side, I think that we have most of the 
>>>>>>>>> AMD-specific low-level shuffle intrinsics implemented that you 
>>>>>>>>> need to do this, but I can think of a few concerns/questions. 
>>>>>>>>> First of all, to implement the prefix scan, we'll need to do a 
>>>>>>>>> code sequence that looks like this, modified from 
>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ 
>>>>>>>>> (replace
>>>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>>>
>>>>>>>>> ; v0 is the input register
>>>>>>>>> v_mov_b32 v1, v0
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add 
>>>>>>>>> two independent instructions to avoid a data hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 
>>>>>>>>> v_nop // Add two independent instructions to avoid a data 
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 
>>>>>>>>> v_nop // Add two independent instructions to avoid a data 
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 
>>>>>>>>> 6 v_nop // Add two independent instructions to avoid a data 
>>>>>>>>> hazard v_nop
>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 
>>>>>>>>> 7
>>>>>>>>>
>>>>>>>>> The problem is that the way these instructions use the DPP 
>>>>>>>>> word isn't currently expressible in LLVM. We have the 
>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For 
>>>>>>>>> example, take the first
>>>>>>>>> instruction:
>>>>>>>>>
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>
>>>>>>>>> What it's doing is shifting v0 right by one within each row 
>>>>>>>>> and adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as 
>>>>>>>>> something like this, in LLVM-like pseduocode:
>>>>>>>>>
>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo 
>>>>>>>>> %tmp, %input
>>>>>>>>>
>>>>>>>>> but this is incorrect. If I'm reading the source correctly, 
>>>>>>>>> this will make %tmp garbage in lane 0 (since it just turns 
>>>>>>>>> into a normal move with the dpp modifier, and no restrictions 
>>>>>>>>> on the destination). We could set bound_ctrl to 0 to work 
>>>>>>>>> around this, since it will make %tmp
>>>>>>>>> 0 in lane 0, but that won't work with operations whose 
>>>>>>>>> identity is
>>>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>>>
>>>>>>>
>>>>>>> Why is %tmp garbage?  I thought the two options were 0 
>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)?
>>>>>>
>>>>>> Oh, maybe it is... for that to happen the underlying move would 
>>>>>> need to have the source and destination constrained to be the 
>>>>>> same. I couldn't see that constraint anywhere I looked, but I'm 
>>>>>> not an expert, so I may have overlooked it. In any case, that 
>>>>>> behavior still isn't what we want if we want to implement the 
>>>>>> prefix scan operations efficiently.
>>>>>>
>>>>>
>>>>> Ok, I see what you are saying now.  I think the best option here 
>>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp 
>>>>> intrinsic is to copy its src operand to dst when bound_ctrl = 1 
>>>>> and it reads from an invalid thread, and then when bound_ctrl=1, 
>>>>> lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>>>
>>>>> -Tom
>>>>
>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>>>
>>>
>>> I think re-defining the existing intrinsic will be easier than 
>>> adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp 
>>> will still be broken for the bound_ctrl=1 case, which isn't really ideal.
>>
>> You can't re-define the existing intrinsic. The problem is that a 
>> move with DPP doesn't just write to its destination register - it 
>> modifies it, i.e. it reads and then writes to it. You can't express 
>> that with SSA, since you can only ever write to an SSA value once. I 
>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp 
>> (dunno what you'd call it?) that takes the value to use if the read 
>> is invalid. That is, something like:
>>
>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control)
>>
>
> This is a more general form of what I'm suggesting.  If you re-define 
> (when I say re-define, I just mean document this behavior) 
> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src 
> will be written to %dst for invalid lanes, then this will have the 
> same effect as llvm.amdgcn.update.dpp when %old = %update.
>
> I can see the usefulness now of having this extra intrinsic, but I 
> still think it would be good to fix llvm.amdgcn.mov.dpp.
> At the MachineIR level, I think you could handle the new intrinsic and 
> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo 
> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you 
> are adding the new intrinsic.

Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to generate the best possible code sequence, and it's a strict superset of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to not bother fixing it and just deprecate it instead.

>
>
>> would get turned into:
>>
>> v_mov_b32 %new, %old
>> v_mov_b32 %new, %update (dpp control)
>>
>> And hopefully coalescing will remove the first copy. Think of it as 
>> like lowering LLVM's three-address form for e.g. ADD to the x86 
>> two-address form. If you wanted to get fancy, you could also 
>> recognize the case where the old value is zero and turn that into a 
>> DPP move with bound_ctrl = 1.
>>
>> Also, remember that the whole point of this was to be able to express 
>> stuff like:
>>
>> v_min_f32 v1, v0, v1 (dpp control)
>>
>> where you take the minimum of v1 and the swizzled v0, except where 
>> you would've read an invalid lane for v0, you read the old value for 
>> v1 instead. For operations like add and iadd where the identity is 0, 
>> you can set bound_ctrl = 1, and then the optimizer can safely fold 
>> the
>> v_mov_b32 into the operation itself. That is, you'd do:
>>
>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control) 
>> %new = i32 add %swizzled, %old
>>
>> and after coalescing, register allocation, etc. the backend would 
>> turn that into:
>>
>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1
>>
>> which is functionally equivalent to the version without the bound_ctrl.
>>
>> Otherwise, for operations `op' where a `op' a == a, you can do something like:
>>
>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control) 
>> %new = f32 fmin %swizzled, %old
>>
>> and then if the optimizer is clever enough, it can also fold it into 
>> one instruction since the invalid lanes will get their old values 
>> back. I think that covers all the cases we care about. The question 
>> is whether it's better to take that route, or whether it's better to 
>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, 
>> etc. so
>> that:
>>
>
> I would only go the route of adding a dpp intrinsic for every 
> operation if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp.
> The main reason for this is that you will lose the generic combines 
> that LLVM has for add, min, etc.

Ok, good point. I think this is only going to be used in a few specific scenarios, but I can see stuff like fusing add + into mad being useful.

>
>
> -Tom
>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control)
>>
>> turns into:
>>
>> v_mov_b32 %new, %old
>> v_min_f32 %new, %src0, %src1 (dpp_control)
>>
>> and then we can get what we want directly, without have to do much 
>> optimization except for coalescing that already exists. The downside 
>> is that we'd have to add a lot more intrinsics (one for min, max, 
>> fmin, fmax, add, and iadd).
>>
>>>
>>> -Tom
>>>
>>>> -Matt
>>>>
>>>
>