[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Wed Dec 19 11:31:41 PST 2018

Hi all,

LLVM needs a solution to the long-standing problem that the IR is unable 
to express certain semantics expected by high-level programming 
languages that target GPUs.

Solving this issue is necessary both for upstream use of LLVM as a 
compiler backend for GPUs and for correctly supporting LLVM IR <-> 
SPIR-V roundtrip translation. It may also be useful for compilers 
targeting SPMD-on-CPU a la ispc -- after all, some GPU hardware really 
just looks like a CPU with extra-wide SIMD.

After thinking and talking about the problem on and off for more than 
two years now, I'm convinced that it can only be solved by adding 
dedicated semantics to LLVM IR, which take the shape of:

- a new function (and call) attribute similar to `convergent`,
- explicitly adding language to the LangRef that talks about groups of 
threads and their communication with each other via special functions,
- including how these groups of threads are impacted (split and merged) 
by branches etc., and
- new target-independent intrinsic(s) which manipulate subgroups of 
threads, mostly by marking locations in the program where threads reconverge

Details to be determined, of course.

In this mail, I would mostly like to make initial contact with the 
larger community. First, to explain the problem to CPU-centric folks and 
maybe help get over the initial shock. Second, to reach out to other 
people thinking about GPUs: Has anybody else given this issue much 
thought? How can we proceed to get this solved?

The Problem
===========
Programming languages for GPUs have "cross-lane" or "subgroup" 
operations which allow fine-grained exchange of data between threads 
being executed together in a "wave" or "warp".

The poster-child is ballot, which takes a boolean argument and returns a 
bitmask of the value of the argument across the 
"subgroup"/"wave"/"warp", but more complex operations exist as well e.g. 
for reducing a value across all active lanes of a wave or for computing 
a prefix scan.

The key issue is that *the set of threads participating in the exchange 
of data is implicitly defined by control flow*.

Two examples to demonstrate the resulting problem and the limitation of 
the existing LLVM IR semantics. The first one:

       bool value = ...;

       if (condition) {
         bitmask0 = ballot(value);
         foo(bitmask0);
       } else {
         bitmask1 = ballot(value);
         bar(bitmask1);
       }

The semantics of high-level languages demand that `bitmask0` only 
contains set bits for threads (lanes) for which `condition` is true, and 
analogously for `bitmask1`. However, there is no reasonable way in LLVM 
IR to prevent the ballot call from being hoisted above the if-statement, 
which changes the behavior.

(Existing frontends for the AMDGPU target currently implement a gross 
hack where `value` is passed through a call to a unique chunk of no-op 
inline assembly that is marked as having unspecified side effects...)

The second example:

       uint64_t bitmask;
       for (;;) {
         ...
         if (...) {
           bool value = ...;
           bitmask = ballot(value);
           break;
         }
         ...
       }

The semantics of high-level languages expect that `bitmask` only 
contains set bits for threads (lanes) which break from the loop in the 
same iteration. However, the basic block containing the ballot call in 
the natural lowering to LLVM IR is not part of the loop at all. The 
information that it was intended to be run as part of the loop is 
currently lost forever.

The Design Space of Solutions
=============================
The relevant high-level languages are structured programming languages, 
where the behavior of subgroups falls out quite naturally. Needless to 
say, we cannot rely on structured control flow in LLVM IR.

HSAIL defines subgroups as forking at branches and joining at the 
immediate post-dominator. It also attempts to define restrictions on 
program transformations in terms of immediate dominators and 
post-dominators. I am not certain that this definition is sound in all 
cases, and it is too restrictive in places.

Its main disadvantage is that describing restrictions on transformations 
purely in terms of dominators and post-dominators causes non-local 
effects. For example, jump threading can change the 
dominator/post-dominator tree, but verifying whether the corresponding 
transform is legal in the face of subgroup operations requires 
inspecting distant parts of the code that jump threading would not have 
to look at for a CPU target.

So I reject HSAIL-like approaches based on the fact that they would 
require invasive changes in generic middle-end passes.

There is a type of approach that most people who come into contact with 
this problem eventually at least think about, which suggests replacing 
the implicit dependence on control flow by an explicit one. That is, 
augment subgroup intrinsics with an additional argument that represents 
the subgroup of threads which participate in the exchange of data 
described by this intrinsic, which results in code that looks similar to 
the co-operative groups in new versions of Cuda.

This kind of approach can be a valid solution to the problem of 
preserving the correct semantics, although it imposes annoying 
restrictions on function call ABIs.

The major problem with this kind of approach is that it does not 
actually restrict the transforms made by middle-end passes, and so the 
final IR before code generation might end up containing code patterns 
that cannot be natively expressed on a target that implements SPMD with 
lock-step execution. The burden on the backend to reliably catch and 
implement the rare cases when this happens is excessive. Even for 
targets without lock-step execution, these transforms may have unwanted 
performance impacts. So I reject this type of proposal as well.

The literature from practicioners on SPMD/SIMT control flow (a lot of it 
targeting a hardware audience rather than a compiler audience) does not 
concern itself with this problem to my knowledge, but there is a 
commonly recurring theme of reconverging or rejoining threads at 
explicit instructions and/or post-dominators.

This suggests a viable path towards a solution to me.

The SPIR-V spec has a notion of explicitly structured control flow with 
merge basic blocks. It also defines "dynamic instances" of instructions 
that are distinguished by control flow path, which provides a decent 
option for modeling the set of threads which participate in subgroup 
operations.

The SPIR-V spec itself is IMO quite lacking, in that the formalism is 
very incomplete, the concrete structured control flow constructs are 
very complex, and the details of how they are expressed lead to the same 
non-local effects you get with the HSAIL approach. Nevertheless, I think 
there's another viable path towards a solution hidden there.

Finally and for completeness, it should be noted that if we were to 
forbid irreducible control flow in LLVM IR entirely, that would open up 
more freedom in the design since we could properly define some things in 
terms of loop structure.

Final Thoughts
==============
Formalizing these semantics could also help put divergence analysis on a 
more solid foundation.

Mostly I'm interested in general feedback at this point. Is there an 
important part of the design space that I missed? What do people think 
about explicitly talking about thread groups and e.g. dynamic instances 
as in SPIR-V as part of the LLVM LangRef? Are people generally happy 
with the notion? If not, how can we get to a place where people are 
happy with it?

Thanks,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.