[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Mon Jun 12 17:23:20 PDT 2017

On 06/12/2017 08:03 PM, Connor Abbott wrote:
> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote:
>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>>> cc some people who have worked on this.
>>>
>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>>> Hi all,
>>>>
>>>> I've been looking into how to implement the more advanced Shader Model
>>>> 6 reduction operations in radv (and obviously most of the work would
>>>> be useful for radeonsi too). They're explained in the spec for
>>>> GL_AMD_shader_ballot at
>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>>> but I'll summarize them here. There are two types of operations:
>>>> reductions that always return a uniform value, and prefix scan
>>>> operations. The reductions can be implemented in terms of the prefix
>>>> scan (although in practice I don't think we want to implement them in
>>>> exactly the same way), and the concerns are mostly the same, so I'll
>>>> focus on the prefix scan operations for now. Given an operation `op'
>>>> and an input value `a' (that's really a SIMD array with one value per
>>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>>> work for non-uniform control flow: it simply skips inactive
>>>> invocations.
>>>>
>>>> On the LLVM side, I think that we have most of the AMD-specific
>>>> low-level shuffle intrinsics implemented that you need to do this, but
>>>> I can think of a few concerns/questions. First of all, to implement
>>>> the prefix scan, we'll need to do a code sequence that looks like
>>>> this, modified from
>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>>> v_foo_f32 with the appropriate operation):
>>>>
>>>> ; v0 is the input register
>>>> v_mov_b32 v1, v0
>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>>> v_nop // Add two independent instructions to avoid a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>
>>>> The problem is that the way these instructions use the DPP word isn't
>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>> intrinsic, but it isn't enough. For example, take the first
>>>> instruction:
>>>>
>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>
>>>> What it's doing is shifting v0 right by one within each row and adding
>>>> it to v1. v1 stays the same in the first lane of each row, however.
>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>>> this, in LLVM-like pseduocode:
>>>>
>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>>> %result = foo %tmp, %input
>>>>
>>>> but this is incorrect. If I'm reading the source correctly, this will
>>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>>> with the dpp modifier, and no restrictions on the destination). We
>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>>> 0 in lane 0, but that won't work with operations whose identity is
>>>> non-0 like min and max. What we need is something like:
>>>>
>>
>> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
>> or %input (bound_ctrl = 1)?
> 
> Oh, maybe it is... for that to happen the underlying move would need
> to have the source and destination constrained to be the same. I
> couldn't see that constraint anywhere I looked, but I'm not an expert,
> so I may have overlooked it. In any case, that behavior still isn't
> what we want if we want to implement the prefix scan operations
> efficiently.
> 

Ok, I see what you are saying now.  I think the best option here is to
document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to
copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid
thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
of V_MOV_B32_dpp where the src and dst are the same register.

-Tom

> Connor
> 
>>
>> -Tom
>>
>>
>>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>>>
>>>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>>>> then applies the DPP swizzling to the second argument and does 'foo'
>>>> to the second and third arguments. It would mean that we'd have a
>>>> separate intrinsic for every operation we care about, but I can't
>>>> think of a better way to express it. Is there a better way that
>>>> doesn't involve creating an intrinsic for each operation?
>>>>
>>>> Next, there's the fact that this code sequence only works when the
>>>> active lanes are densely-packed, but we have to make this work even
>>>> when control flow is non-uniform. Essentially, we need to "skip over"
>>>> the inactive lanes by setting them to the identity, and then we need
>>>> to enable them in the exec mask when doing the reduction to make sure
>>>> they pass along the correct result. That is, to handle non-uniform
>>>> control flow, we need something like:
>>>>
>>>> invert EXEC
>>>> result = identity
>>>> set EXEC to ~0
>>>> <original code sequence>
>>>> restore original EXEC
>>>>
>>>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>>>> intrinsic that returns the first argument with inactive lanes set to
>>>> the second argument. We'd also need something like WQM to make all the
>>>> lanes active during the sequence. But that raises some hairy
>>>> requirements for register allocation. For example, in something like:
>>>>
>>>> foo = ...
>>>> if (...) {
>>>>     bar = minInvocationsInclusiveScanAMD(...)
>>>> } else {
>>>>     ... = foo;
>>>> }
>>>>
>>>> we have to make sure that foo isn't allocated to the same register as
>>>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>>>> though they don't interfere. That's because the implementation of
>>>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>>>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>>>> consider the following:
>>>>
>>>> do {
>>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>>    // ...
>>>>    ... = bar; // last use of bar
>>>>   foo = ...;
>>>> } while (...);
>>>>
>>>> ... = foo;
>>>>
>>>> again, foo and the temporaries used to compute bar can't be assigned
>>>> to the same register, even though their live ranges don't intersect,
>>>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>>>> in a previous iteration if the loop exit condition isn't uniform. How
>>>> can we express this in the backend? I don't know much about the LLVM
>>>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>>>>
>>>> Thanks,
>>>>
>>>> Connor
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>