[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Mon Jun 12 14:58:07 PDT 2017

Hi all,

I've been looking into how to implement the more advanced Shader Model
6 reduction operations in radv (and obviously most of the work would
be useful for radeonsi too). They're explained in the spec for
GL_AMD_shader_ballot at
https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt,
but I'll summarize them here. There are two types of operations:
reductions that always return a uniform value, and prefix scan
operations. The reductions can be implemented in terms of the prefix
scan (although in practice I don't think we want to implement them in
exactly the same way), and the concerns are mostly the same, so I'll
focus on the prefix scan operations for now. Given an operation `op'
and an input value `a' (that's really a SIMD array with one value per
invocation, even though it's a scalar value in LLVM), the prefix scan
returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]
`op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also
work for non-uniform control flow: it simply skips inactive
invocations.

On the LLVM side, I think that we have most of the AMD-specific
low-level shuffle intrinsics implemented that you need to do this, but
I can think of a few concerns/questions. First of all, to implement
the prefix scan, we'll need to do a code sequence that looks like
this, modified from
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace
v_foo_f32 with the appropriate operation):

; v0 is the input register
v_mov_b32 v1, v0
v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1
v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2
v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7

The problem is that the way these instructions use the DPP word isn't
currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
intrinsic, but it isn't enough. For example, take the first
instruction:

v_foo_f32 v1, v0, v1 row_shr:1

What it's doing is shifting v0 right by one within each row and adding
it to v1. v1 stays the same in the first lane of each row, however.
With llvm.amdgcn.mov_dpp, we could try to express it as something like
this, in LLVM-like pseduocode:

%tmp = call llvm.amdgcn.mov_dpp %input row_shr:1
%result = foo %tmp, %input

but this is incorrect. If I'm reading the source correctly, this will
make %tmp garbage in lane 0 (since it just turns into a normal move
with the dpp modifier, and no restrictions on the destination). We
could set bound_ctrl to 0 to work around this, since it will make %tmp
0 in lane 0, but that won't work with operations whose identity is
non-0 like min and max. What we need is something like:

%result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1

where llvm.amdgcn.foo_dpp copies the first argument to the result,
then applies the DPP swizzling to the second argument and does 'foo'
to the second and third arguments. It would mean that we'd have a
separate intrinsic for every operation we care about, but I can't
think of a better way to express it. Is there a better way that
doesn't involve creating an intrinsic for each operation?

Next, there's the fact that this code sequence only works when the
active lanes are densely-packed, but we have to make this work even
when control flow is non-uniform. Essentially, we need to "skip over"
the inactive lanes by setting them to the identity, and then we need
to enable them in the exec mask when doing the reduction to make sure
they pass along the correct result. That is, to handle non-uniform
control flow, we need something like:

invert EXEC
result = identity
set EXEC to ~0
<original code sequence>
restore original EXEC

I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
intrinsic that returns the first argument with inactive lanes set to
the second argument. We'd also need something like WQM to make all the
lanes active during the sequence. But that raises some hairy
requirements for register allocation. For example, in something like:

foo = ...
if (...) {
    bar = minInvocationsInclusiveScanAMD(...)
} else {
    ... = foo;
}

we have to make sure that foo isn't allocated to the same register as
one of the temporaries used inside minInvocationsInclusiveScanAMD(),
though they don't interfere. That's because the implementation of
minInvocationsInclusiveScanAMD() will do funny things with the exec
mask, possibly overwriting foo, if the condition is non-uniform. Or
consider the following:

do {
   bar = minInvocationsInclusiveScanAMD(...);
   // ...
   ... = bar; // last use of bar
  foo = ...;
} while (...);

... = foo;

again, foo and the temporaries used to compute bar can't be assigned
to the same register, even though their live ranges don't intersect,
since minInvocationsInclusiveScanAMD() may overwrite the value of foo
in a previous iteration if the loop exit condition isn't uniform. How
can we express this in the backend? I don't know much about the LLVM
infrastucture, so I'm not sure if it's relatively easy or really hard.

Thanks,

Connor