[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Tom Stellard via llvm-dev llvm-dev at lists.llvm.org
Mon Jun 12 16:56:43 PDT 2017


On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:
> cc some people who have worked on this.
> 
> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
>> Hi all,
>>
>> I've been looking into how to implement the more advanced Shader Model
>> 6 reduction operations in radv (and obviously most of the work would
>> be useful for radeonsi too). They're explained in the spec for
>> GL_AMD_shader_ballot at
>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
>> but I'll summarize them here. There are two types of operations:
>> reductions that always return a uniform value, and prefix scan
>> operations. The reductions can be implemented in terms of the prefix
>> scan (although in practice I don't think we want to implement them in
>> exactly the same way), and the concerns are mostly the same, so I'll
>> focus on the prefix scan operations for now. Given an operation `op'
>> and an input value `a' (that's really a SIMD array with one value per
>> invocation, even though it's a scalar value in LLVM), the prefix scan
>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
>> work for non-uniform control flow: it simply skips inactive
>> invocations.
>>
>> On the LLVM side, I think that we have most of the AMD-specific
>> low-level shuffle intrinsics implemented that you need to do this, but
>> I can think of a few concerns/questions. First of all, to implement
>> the prefix scan, we'll need to do a code sequence that looks like
>> this, modified from
>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
>> v_foo_f32 with the appropriate operation):
>>
>> ; v0 is the input register
>> v_mov_b32 v1, v0
>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
>> v_nop // Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>
>> The problem is that the way these instructions use the DPP word isn't
>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>> intrinsic, but it isn't enough. For example, take the first
>> instruction:
>>
>> v_foo_f32 v1, v0, v1 row_shr:1
>>
>> What it's doing is shifting v0 right by one within each row and adding
>> it to v1. v1 stays the same in the first lane of each row, however.
>> With llvm.amdgcn.mov_dpp, we could try to express it as something like
>> this, in LLVM-like pseduocode:
>>
>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
>> %result = foo %tmp, %input
>>
>> but this is incorrect. If I'm reading the source correctly, this will
>> make %tmp garbage in lane 0 (since it just turns into a normal move
>> with the dpp modifier, and no restrictions on the destination). We
>> could set bound_ctrl to 0 to work around this, since it will make %tmp
>> 0 in lane 0, but that won't work with operations whose identity is
>> non-0 like min and max. What we need is something like:
>>

Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)
or %input (bound_ctrl = 1)?

-Tom


>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>
>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>> then applies the DPP swizzling to the second argument and does 'foo'
>> to the second and third arguments. It would mean that we'd have a
>> separate intrinsic for every operation we care about, but I can't
>> think of a better way to express it. Is there a better way that
>> doesn't involve creating an intrinsic for each operation?
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform. Essentially, we need to "skip over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>>
>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>     bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>     ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>    bar = minInvocationsInclusiveScanAMD(...);
>>    // ...
>>    ... = bar; // last use of bar
>>   foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be assigned
>> to the same register, even though their live ranges don't intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform. How
>> can we express this in the backend? I don't know much about the LLVM
>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>>
>> Thanks,
>>
>> Connor
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 



More information about the llvm-dev mailing list