[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Nicolai Hähnle via llvm-dev
llvm-dev at lists.llvm.org
Sun Aug 9 08:03:42 PDT 2020
Hi all,
please see https://reviews.llvm.org/D85603 and its related changes for
our most recent and hopefully final attempt at putting the `convergent`
attribute on a solid theoretical foundation in a way that is useful for
modern GPU compiler use cases. We have clear line of sight to enabling a
new control flow implementation in the AMDGPU backend which is built on
this foundation. I have started upstreaming some ground work recently.
We expect this foundation to also be usable by non-GPU whole-program
vectorization environments, if they choose to do so.
What follows is a (not quite) brief overview of what this is all about,
but do see the linked review and its "stack" for a fuller story.
The Problem
-----------
`convergent` was originally introduced in LLVM to label "barrier"
instructions in GPU programming languages. Those put some unusual
constraints on program transforms that weren't particularly well
understood at the time. Since then, GPU languages have introduced
additional instructions such as wave shuffles or subgroup reductions
that made the problem harder and exposed some shortcomings of the
current definition of `convergent`. At the same time, those additional
instructions helped illuminate what the problem really is.
Briefly, in whole-program vectorization environments such as GPUs, we
can expose opportunities for performance optimizations to developers by
allowing data exchange between threads that are grouped together as
lanes in a vector. As long as control flow is uniform, i.e. all threads
of the vector follow the same path through the CFG, it is natural that
all threads assigned to the same vector participate in this data
exchange. When control flow diverges, some threads may be unavailable
for the data exchange, but we still need to allow data exchange among
the threads that _are_ available.
This means that those data exchange operations -- i.e., `convergent`
operations -- communicate among a set of threads that implicitly depends
on the position of the operation in control flow.
The problem boils down to defining how the set of communicating threads
is determined.
This problem inherently has a large target- and environment-specific
component. Even if control flow is fully uniform, there is the question
of which threads are grouped together in the first place. In the same
way that LLVM IR has no built-in understanding of how threads come to be
in the first place (even in the more well-known multi-threaded CPU
programming world where convergent operations don't exist), we don't
concern ourselves at all with these types of questions here. For the
purpose of target-independent LLVM IR semantics, we simply assume that
they are answered _somehow_ and instead focus on the part of the problem
that relates to control flow.
When looking at high-level language source, it is often "intuitively
obvious" which threads should communicate. This is not the case in an
unstructured CFG. Here's an example of a non-trivial loop where a
reduction is used (the result of the reduction is the sum of the input
values of all participating threads):
A:
br label %B
B:
...
%sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
...
br i1 %cc1, label %B, label %C
C:
...
br i1 %cc2, label %B, label %D
D:
; loop exit
Suppose this code is executed by two threads grouped in a (very short)
vector, and the threads execute the following sequences of basic blocks:
> Thread 1: A B B C D
> Thread 2: A B C B C D
There are two different intuitive ways of "aligning" the threads for the
purpose of communication between convergent operations:
> Align based on natural loops:
> Thread 1: A B - B C D
> Thread 2: A B C B C D
> Align based on a nested loop interpretation:
> Thread 1: A B B C - - D
> Thread 2: A B - C B C D
(Explanation of the alignment: In the first alignment, %sum.b is always
the sum of the input values from both threads. In the second alignment,
the first computation of %sum.b is collaborative between the two
threads; the second one in each thread simply returns the input value
for that thread.)
The example only has a single natural loop, but it could result from a
high-level language source that has two nested do-while loops, so the
second interpretation may well be what the programmer intended.
It has often been proposed that the alignment should be defined purely
based on the IR shown above, such as by always aligning based on natural
loops. This doesn't work, even if we ignore the possibility of
irreducible control flow. In the example, we could in principle generate
IR as follows if the "nested loop alignment" is desired:
A:
br label %outer.hdr
outer.hdr:
br label %B
B:
...
%sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
...
br i1 %cc1, label %B, label %C
C:
...
br i1 %cc2, label %outer.hdr, label %D
D:
; loop exit
From a single-threaded perspective, it would be correct to short-cut
all paths through outer.hdr, but this results in the original IR and
therefore different behavior for the convergent operation (under such a
proposal). Worse, the part of the IR which makes the short-cutting
incorrect -- i.e., the affected convergent operation -- is potentially
far away from the paths that we're short-cutting. We want to avoid this
"spooky action at a distance" because it puts an undue burden on generic
code transforms.
The Solution
------------
We introduce explicit annotations in the IR using "convergence tokens"
that "anchor" a convergent operation with respect to some other point in
control flow or with respect to the function entry. The details are in
the linked Phabricator review, so I will simply illustrate here how this
solves the example above.
To obtain the original natural loop alignment, we augment the example as
follows:
A:
%anchor = call token @llvm.experimental.convergence.anchor()
br label %B
B:
%loop = call token @llvm.experimental.convergence.loop() [
"convergencectrl"(token %anchor) ]
...
%sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
%loop) ]
...
br i1 %cc1, label %B, label %C
C:
...
br i1 %cc2, label %B, label %D
D:
; loop exit
To obtain the loop nest alignment, we augment it as follows:
A:
%anchor = call token @llvm.experimental.convergence.anchor()
br label %outer.hdr
outer.hdr:
%outer = call token @llvm.experimental.convergence.loop() [
"convergencectrl"(token %anchor) ]
br label %B
B:
%inner = call token @llvm.experimental.convergence.loop() [
"convergencectrl"(token %outer) ]
...
%sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
%inner) ]
...
br i1 %cc1, label %B, label %C
C:
...
br i1 %cc2, label %outer.hdr, label %D
D:
; loop exit
We end up with two nested natural loops as before, but short-cutting
through outer.hdr is easily seen as forbidden because of the "loop
heart" intrinsic along the path through outer.hdr. (We call it a "heart"
because intuitively that's where loop iterations are counted for the
purpose of convergent operations, i.e. we count "heartbeats" of the
loop.) Having the non-removable outer.hdr block may seem like an
inefficiency, but at least in the AMDGPU backend we end up needing that
basic block anyway to implement the desired behavior. We don't literally
count loop iterations in the AMDGPU backend, but we end up doing
something isomorphic that also requires the block. We suspect that other
implementations will have similar requirements. In any case, backends
are free to optimize the block away using target-specific handling.
Have you tried other solutions?
-------------------------------
Yes, we've tried many things over the years. I've personally been
thinking about and working on this problem on and off for a significant
amount of time over the last four years. I supervised a practical master
thesis on a related topic during this time. Other people have thought
about the problem a lot as well. Discussions with many people in the
LLVM community, inside AMD, and elsewhere have all helped our
understanding of the problem. Many different ideas were considered and
ultimately rejected in the process.
All attempts to solve the problem _without_ some form of "convergence
token" have suffered from spooky action at a distance. This includes the
current definition of `convergent` in the LangRef.
We tried _very_ hard to find a workable definition of `convergent` that
doesn't talk about groups of threads in the LangRef, but ultimately came
to the conclusion that thinking about convergent operations inherently
involves threads of execution, and it's much easier to just accept that.
As I've already written above, convergent operations really are a form
of (often non- or inaccessible-memory) communication among threads where
the set of communicating threads is affected by the static position of
the operation in control flow.
The proposal contains no lock-step execution, even though that's what
many people think of when they think about GPU programming models. It is
purely expressed in terms of communicating threads, which could in
theory just be threads on a CPU. We believe that this is more in line
with how LLVM works. Besides, we found lock-step execution semantics to
be undesirable from first principles, because they prevent certain
desirable program transforms or at least make them much more difficult
to achieve.
The most recent earlier proposal we brought to the LLVM community is
here: https://reviews.llvm.org/D68994. That one already used a form of
convergence token, but we ultimately rejected it for two main reasons.
First, the convergence control intrinsics described there create more
"noise" in the IR, i.e. there are more intrinsic calls than with what we
have now. Second, the particular formalism of dynamic instances used
there had some undesirable side effects that put undue burden on
transforms that should be trivial like dead-code elimination.
The current proposal still talks about dynamic instances, but there is
no longer any automatic propagation of convergence: even if two threads
execute the same dynamic instances of an instruction, whether they
execute the same dynamic instance of the very next instruction in
straight-line code is implementation-defined in general (which means
that program transforms are free to make changes that could affect this).
We chose to use an operand bundle for the convergence token so that
generic code can identify the relevant token value easily. We considered
adding the convergence token as a regular function call argument, but
this makes it impractical for generic transforms to understand functions
that take both a convergence token and some other token.
Cheers,
Nicolai
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
More information about the llvm-dev
mailing list