[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Mon Jun 12 16:15:35 PDT 2017

cc some people who have worked on this.

On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:
> Hi all,
> 
> I've been looking into how to implement the more advanced Shader Model
> 6 reduction operations in radv (and obviously most of the work would
> be useful for radeonsi too). They're explained in the spec for
> GL_AMD_shader_ballot at
> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
> but I'll summarize them here. There are two types of operations:
> reductions that always return a uniform value, and prefix scan
> operations. The reductions can be implemented in terms of the prefix
> scan (although in practice I don't think we want to implement them in
> exactly the same way), and the concerns are mostly the same, so I'll
> focus on the prefix scan operations for now. Given an operation `op'
> and an input value `a' (that's really a SIMD array with one value per
> invocation, even though it's a scalar value in LLVM), the prefix scan
> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
> work for non-uniform control flow: it simply skips inactive
> invocations.
> 
> On the LLVM side, I think that we have most of the AMD-specific
> low-level shuffle intrinsics implemented that you need to do this, but
> I can think of a few concerns/questions. First of all, to implement
> the prefix scan, we'll need to do a code sequence that looks like
> this, modified from
> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
> v_foo_f32 with the appropriate operation):
> 
> ; v0 is the input register
> v_mov_b32 v1, v0
> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
> v_nop // Add two independent instructions to avoid a data hazard
> v_nop
> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
> 
> The problem is that the way these instructions use the DPP word isn't
> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
> intrinsic, but it isn't enough. For example, take the first
> instruction:
> 
> v_foo_f32 v1, v0, v1 row_shr:1
> 
> What it's doing is shifting v0 right by one within each row and adding
> it to v1. v1 stays the same in the first lane of each row, however.
> With llvm.amdgcn.mov_dpp, we could try to express it as something like
> this, in LLVM-like pseduocode:
> 
> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
> %result = foo %tmp, %input
> 
> but this is incorrect. If I'm reading the source correctly, this will
> make %tmp garbage in lane 0 (since it just turns into a normal move
> with the dpp modifier, and no restrictions on the destination). We
> could set bound_ctrl to 0 to work around this, since it will make %tmp
> 0 in lane 0, but that won't work with operations whose identity is
> non-0 like min and max. What we need is something like:
> 
> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
> 
> where llvm.amdgcn.foo_dpp copies the first argument to the result,
> then applies the DPP swizzling to the second argument and does 'foo'
> to the second and third arguments. It would mean that we'd have a
> separate intrinsic for every operation we care about, but I can't
> think of a better way to express it. Is there a better way that
> doesn't involve creating an intrinsic for each operation?
> 
> Next, there's the fact that this code sequence only works when the
> active lanes are densely-packed, but we have to make this work even
> when control flow is non-uniform. Essentially, we need to "skip over"
> the inactive lanes by setting them to the identity, and then we need
> to enable them in the exec mask when doing the reduction to make sure
> they pass along the correct result. That is, to handle non-uniform
> control flow, we need something like:
> 
> invert EXEC
> result = identity
> set EXEC to ~0
> <original code sequence>
> restore original EXEC
> 
> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
> intrinsic that returns the first argument with inactive lanes set to
> the second argument. We'd also need something like WQM to make all the
> lanes active during the sequence. But that raises some hairy
> requirements for register allocation. For example, in something like:
> 
> foo = ...
> if (...) {
>     bar = minInvocationsInclusiveScanAMD(...)
> } else {
>     ... = foo;
> }
> 
> we have to make sure that foo isn't allocated to the same register as
> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
> though they don't interfere. That's because the implementation of
> minInvocationsInclusiveScanAMD() will do funny things with the exec
> mask, possibly overwriting foo, if the condition is non-uniform. Or
> consider the following:
> 
> do {
>    bar = minInvocationsInclusiveScanAMD(...);
>    // ...
>    ... = bar; // last use of bar
>   foo = ...;
> } while (...);
> 
> ... = foo;
> 
> again, foo and the temporaries used to compute bar can't be assigned
> to the same register, even though their live ranges don't intersect,
> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
> in a previous iteration if the loop exit condition isn't uniform. How
> can we express this in the backend? I don't know much about the LLVM
> infrastucture, so I'm not sure if it's relatively easy or really hard.
> 
> Thanks,
> 
> Connor
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>