[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Wed Jun 14 07:35:44 PDT 2017

On 06/14/2017 10:33 AM, Sumner, Brian wrote:
> Sorry about the formatting...
> 
> Anyway, I think there may be a misinterpretation of bound_cntl.  My understanding is that:
> 
> 0 => if the source is invalid or disabled, do not write a result
> 1 => if the source is invalid or disabled, use a 0 instead
> 
> So the problematic case is where bound_cntl is 0, not when it is 1.
> 

Yes, you are right, I had that mixed up.

-Tom

> -----Original Message-----
> From: Tom Stellard [mailto:tstellar at redhat.com] 
> Sent: Tuesday, June 13, 2017 6:13 PM
> To: Matt Arsenault
> Cc: Connor Abbott; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian; Pykhtin, Valery
> Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
> 
> On 06/13/2017 07:33 PM, Matt Arsenault wrote:
>>
>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>
>>> On 06/12/2017 08:03 PM, Connor Abbott wrote:
>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote:
>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>>>>> cc some people who have worked on this.
>>>>>>
>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've been looking into how to implement the more advanced Shader 
>>>>>>> Model
>>>>>>> 6 reduction operations in radv (and obviously most of the work 
>>>>>>> would be useful for radeonsi too). They're explained in the spec 
>>>>>>> for GL_AMD_shader_ballot at 
>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader
>>>>>>> _ballot.txt, but I'll summarize them here. There are two types of 
>>>>>>> operations:
>>>>>>> reductions that always return a uniform value, and prefix scan 
>>>>>>> operations. The reductions can be implemented in terms of the 
>>>>>>> prefix scan (although in practice I don't think we want to 
>>>>>>> implement them in exactly the same way), and the concerns are 
>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op'
>>>>>>> and an input value `a' (that's really a SIMD array with one value 
>>>>>>> per invocation, even though it's a scalar value in LLVM), the 
>>>>>>> prefix scan returns a[0] in invocation 0, a[0] `op' a[1] in 
>>>>>>> invocation 1, a[0] `op' a[1] `op' a[2] in invocation 2, etc. The 
>>>>>>> prefix scan will also work for non-uniform control flow: it 
>>>>>>> simply skips inactive invocations.
>>>>>>>
>>>>>>> On the LLVM side, I think that we have most of the AMD-specific 
>>>>>>> low-level shuffle intrinsics implemented that you need to do 
>>>>>>> this, but I can think of a few concerns/questions. First of all, 
>>>>>>> to implement the prefix scan, we'll need to do a code sequence 
>>>>>>> that looks like this, modified from 
>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ 
>>>>>>> (replace
>>>>>>> v_foo_f32 with the appropriate operation):
>>>>>>>
>>>>>>> ; v0 is the input register
>>>>>>> v_mov_b32 v1, v0
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add two 
>>>>>>> independent instructions to avoid a data hazard v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard 
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard 
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 
>>>>>>> v_nop // Add two independent instructions to avoid a data hazard 
>>>>>>> v_nop
>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>
>>>>>>> The problem is that the way these instructions use the DPP word 
>>>>>>> isn't currently expressible in LLVM. We have the 
>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For example, 
>>>>>>> take the first
>>>>>>> instruction:
>>>>>>>
>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>
>>>>>>> What it's doing is shifting v0 right by one within each row and 
>>>>>>> adding it to v1. v1 stays the same in the first lane of each row, however.
>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something 
>>>>>>> like this, in LLVM-like pseduocode:
>>>>>>>
>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo 
>>>>>>> %tmp, %input
>>>>>>>
>>>>>>> but this is incorrect. If I'm reading the source correctly, this 
>>>>>>> will make %tmp garbage in lane 0 (since it just turns into a 
>>>>>>> normal move with the dpp modifier, and no restrictions on the 
>>>>>>> destination). We could set bound_ctrl to 0 to work around this, 
>>>>>>> since it will make %tmp
>>>>>>> 0 in lane 0, but that won't work with operations whose identity 
>>>>>>> is
>>>>>>> non-0 like min and max. What we need is something like:
>>>>>>>
>>>>>
>>>>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl 
>>>>> =0) or %input (bound_ctrl = 1)?
>>>>
>>>> Oh, maybe it is... for that to happen the underlying move would need 
>>>> to have the source and destination constrained to be the same. I 
>>>> couldn't see that constraint anywhere I looked, but I'm not an 
>>>> expert, so I may have overlooked it. In any case, that behavior 
>>>> still isn't what we want if we want to implement the prefix scan 
>>>> operations efficiently.
>>>>
>>>
>>> Ok, I see what you are saying now.  I think the best option here is 
>>> to document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is 
>>> to copy its src operand to dst when bound_ctrl = 1 and it reads from 
>>> an invalid thread, and then when bound_ctrl=1, lower the intrinsic to 
>>> a special tied version of V_MOV_B32_dpp where the src and dst are the same register.
>>>
>>> -Tom
>>
>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.
>>
> 
> I think re-defining the existing intrinsic will be easier than adding a new one.  Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken for the bound_ctrl=1 case, which isn't really ideal.
> 
> -Tom
> 
>> -Matt
>>
>