[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Mon Jun 12 17:03:32 PDT 2017

On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote:
> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
>> cc some people who have worked on this.
>>
>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>>> Hi all,
>>>
>>> I've been looking into how to implement the more advanced Shader Model
>>> 6 reduction operations in radv (and obviously most of the work would
>>> be useful for radeonsi too). They're explained in the spec for
>>> GL_AMD_shader_ballot at
>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>>> but I'll summarize them here. There are two types of operations:
>>> reductions that always return a uniform value, and prefix scan
>>> operations. The reductions can be implemented in terms of the prefix
>>> scan (although in practice I don't think we want to implement them in
>>> exactly the same way), and the concerns are mostly the same, so I'll
>>> focus on the prefix scan operations for now. Given an operation `op'
>>> and an input value `a' (that's really a SIMD array with one value per
>>> invocation, even though it's a scalar value in LLVM), the prefix scan
>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>>> work for non-uniform control flow: it simply skips inactive
>>> invocations.
>>>
>>> On the LLVM side, I think that we have most of the AMD-specific
>>> low-level shuffle intrinsics implemented that you need to do this, but
>>> I can think of a few concerns/questions. First of all, to implement
>>> the prefix scan, we'll need to do a code sequence that looks like
>>> this, modified from
>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>>> v_foo_f32 with the appropriate operation):
>>>
>>> ; v0 is the input register
>>> v_mov_b32 v1, v0
>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>>> v_nop // Add two independent instructions to avoid a data hazard
>>> v_nop
>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>
>>> The problem is that the way these instructions use the DPP word isn't
>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>> intrinsic, but it isn't enough. For example, take the first
>>> instruction:
>>>
>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>
>>> What it's doing is shifting v0 right by one within each row and adding
>>> it to v1. v1 stays the same in the first lane of each row, however.
>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>>> this, in LLVM-like pseduocode:
>>>
>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>>> %result = foo %tmp, %input
>>>
>>> but this is incorrect. If I'm reading the source correctly, this will
>>> make %tmp garbage in lane 0 (since it just turns into a normal move
>>> with the dpp modifier, and no restrictions on the destination). We
>>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>>> 0 in lane 0, but that won't work with operations whose identity is
>>> non-0 like min and max. What we need is something like:
>>>
>
> Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
> or %input (bound_ctrl = 1)?

Oh, maybe it is... for that to happen the underlying move would need
to have the source and destination constrained to be the same. I
couldn't see that constraint anywhere I looked, but I'm not an expert,
so I may have overlooked it. In any case, that behavior still isn't
what we want if we want to implement the prefix scan operations
efficiently.

Connor

>
> -Tom
>
>
>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>>
>>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>>> then applies the DPP swizzling to the second argument and does 'foo'
>>> to the second and third arguments. It would mean that we'd have a
>>> separate intrinsic for every operation we care about, but I can't
>>> think of a better way to express it. Is there a better way that
>>> doesn't involve creating an intrinsic for each operation?
>>>
>>> Next, there's the fact that this code sequence only works when the
>>> active lanes are densely-packed, but we have to make this work even
>>> when control flow is non-uniform. Essentially, we need to "skip over"
>>> the inactive lanes by setting them to the identity, and then we need
>>> to enable them in the exec mask when doing the reduction to make sure
>>> they pass along the correct result. That is, to handle non-uniform
>>> control flow, we need something like:
>>>
>>> invert EXEC
>>> result = identity
>>> set EXEC to ~0
>>> <original code sequence>
>>> restore original EXEC
>>>
>>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>>> intrinsic that returns the first argument with inactive lanes set to
>>> the second argument. We'd also need something like WQM to make all the
>>> lanes active during the sequence. But that raises some hairy
>>> requirements for register allocation. For example, in something like:
>>>
>>> foo = ...
>>> if (...) {
>>>     bar = minInvocationsInclusiveScanAMD(...)
>>> } else {
>>>     ... = foo;
>>> }
>>>
>>> we have to make sure that foo isn't allocated to the same register as
>>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>>> though they don't interfere. That's because the implementation of
>>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>>> consider the following:
>>>
>>> do {
>>>    bar = minInvocationsInclusiveScanAMD(...);
>>>    // ...
>>>    ... = bar; // last use of bar
>>>   foo = ...;
>>> } while (...);
>>>
>>> ... = foo;
>>>
>>> again, foo and the temporaries used to compute bar can't be assigned
>>> to the same register, even though their live ranges don't intersect,
>>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>>> in a previous iteration if the loop exit condition isn't uniform. How
>>> can we express this in the backend? I don't know much about the LLVM
>>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>>>
>>> Thanks,
>>>
>>> Connor
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>