[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend

Connor Abbott via llvm-dev llvm-dev at lists.llvm.org
Tue Jun 13 11:48:24 PDT 2017


On Mon, Jun 12, 2017 at 11:23 PM, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> On 12.06.2017 23:58, Connor Abbott wrote:
>>
>> Next, there's the fact that this code sequence only works when the
>> active lanes are densely-packed, but we have to make this work even
>> when control flow is non-uniform.
>
>>
>>
>> Essentially, we need to "skip over"
>> the inactive lanes by setting them to the identity, and then we need
>> to enable them in the exec mask when doing the reduction to make sure
>> they pass along the correct result. That is, to handle non-uniform
>> control flow, we need something like:
>>
>> invert EXEC
>> result = identity
>> set EXEC to ~0
>> <original code sequence>
>> restore original EXEC
>
>
> Yeah, this is going to be a pain, mostly in how it could interact with
> register spilling.
>
>
>
>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes
>> intrinsic that returns the first argument with inactive lanes set to
>> the second argument. We'd also need something like WQM to make all the
>> lanes active during the sequence. But that raises some hairy
>> requirements for register allocation. For example, in something like:
>>
>> foo = ...
>> if (...) {
>>      bar = minInvocationsInclusiveScanAMD(...)
>> } else {
>>      ... = foo;
>> }
>>
>> we have to make sure that foo isn't allocated to the same register as
>> one of the temporaries used inside minInvocationsInclusiveScanAMD(),
>> though they don't interfere. That's because the implementation of
>> minInvocationsInclusiveScanAMD() will do funny things with the exec
>> mask, possibly overwriting foo, if the condition is non-uniform. Or
>> consider the following:
>>
>> do {
>>     bar = minInvocationsInclusiveScanAMD(...);
>>     // ...
>>     ... = bar; // last use of bar
>>    foo = ...;
>> } while (...);
>>
>> ... = foo;
>>
>> again, foo and the temporaries used to compute bar can't be assigned
>> to the same register, even though their live ranges don't intersect,
>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo
>> in a previous iteration if the loop exit condition isn't uniform. How
>> can we express this in the backend? I don't know much about the LLVM
>> infrastucture, so I'm not sure if it's relatively easy or really hard.
>
>
> The actual register allocation is probably comparatively harmless. The
> register allocator runs long after control flow structurization, which means
> that in your examples above, the register allocator actually sees that the
> lifetimes of the relevant variables overlap. Specifically, the basic-block
> structure in your first if-based example is actually:
>
>    foo = ...
>    cbranch merge
>    ; fall-through
>
> if:
>    bar = ...
>    ; fall-through
>
> merge:
>    cbranch endif
>    ; fall-through
>
> else:
>    ... = foo
>    ; fall-through
>
> endif:
>
> ... and so foo and bar will not be assigned the same register.

What about my second example? There, the expanded control flow should
be the same as the original control flow, yet we still have the
problem AFAICT.

Also, does this mean that the registers will interfere even if they're
only ever written with non-overlapping exec masks? That seems like a
shame, since in the vast majority of cases you're going to add
artificial extra register pressure and constrain the register
allocator more than necessary...

>
> Again, where it gets hairy is with spilling, because the spiller is
> hopelessly lost when it comes to understanding EXEC masks...

Ok, I'll look into it. Thanks for the hints.

>
> Cheers,
> Nicolai
>
>
>> Thanks,
>>
>> Connor
>>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.


More information about the llvm-dev mailing list