[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Sun Aug 16 17:13:30 PDT 2020
Hi, Nicolai,
Thanks for sending this. What do you think that we should do with the
existing convergent attribute?
-Hal
On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
> Hi all,
>
> please see https://reviews.llvm.org/D85603 and its related changes for
> our most recent and hopefully final attempt at putting the
> `convergent` attribute on a solid theoretical foundation in a way that
> is useful for modern GPU compiler use cases. We have clear line of
> sight to enabling a new control flow implementation in the AMDGPU
> backend which is built on this foundation. I have started upstreaming
> some ground work recently. We expect this foundation to also be usable
> by non-GPU whole-program vectorization environments, if they choose to
> do so.
>
> What follows is a (not quite) brief overview of what this is all
> about, but do see the linked review and its "stack" for a fuller story.
>
>
> The Problem
> -----------
> `convergent` was originally introduced in LLVM to label "barrier"
> instructions in GPU programming languages. Those put some unusual
> constraints on program transforms that weren't particularly well
> understood at the time. Since then, GPU languages have introduced
> additional instructions such as wave shuffles or subgroup reductions
> that made the problem harder and exposed some shortcomings of the
> current definition of `convergent`. At the same time, those additional
> instructions helped illuminate what the problem really is.
>
> Briefly, in whole-program vectorization environments such as GPUs, we
> can expose opportunities for performance optimizations to developers
> by allowing data exchange between threads that are grouped together as
> lanes in a vector. As long as control flow is uniform, i.e. all
> threads of the vector follow the same path through the CFG, it is
> natural that all threads assigned to the same vector participate in
> this data exchange. When control flow diverges, some threads may be
> unavailable for the data exchange, but we still need to allow data
> exchange among the threads that _are_ available.
>
> This means that those data exchange operations -- i.e., `convergent`
> operations -- communicate among a set of threads that implicitly
> depends on the position of the operation in control flow.
>
> The problem boils down to defining how the set of communicating
> threads is determined.
>
> This problem inherently has a large target- and environment-specific
> component. Even if control flow is fully uniform, there is the
> question of which threads are grouped together in the first place. In
> the same way that LLVM IR has no built-in understanding of how threads
> come to be in the first place (even in the more well-known
> multi-threaded CPU programming world where convergent operations don't
> exist), we don't concern ourselves at all with these types of
> questions here. For the purpose of target-independent LLVM IR
> semantics, we simply assume that they are answered _somehow_ and
> instead focus on the part of the problem that relates to control flow.
>
> When looking at high-level language source, it is often "intuitively
> obvious" which threads should communicate. This is not the case in an
> unstructured CFG. Here's an example of a non-trivial loop where a
> reduction is used (the result of the reduction is the sum of the input
> values of all participating threads):
>
> A:
> br label %B
>
> B:
> ...
> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
> ...
> br i1 %cc1, label %B, label %C
>
> C:
> ...
> br i1 %cc2, label %B, label %D
>
> D:
> ; loop exit
>
> Suppose this code is executed by two threads grouped in a (very short)
> vector, and the threads execute the following sequences of basic blocks:
>
> > Thread 1: A B B C D
> > Thread 2: A B C B C D
>
> There are two different intuitive ways of "aligning" the threads for
> the purpose of communication between convergent operations:
>
> > Align based on natural loops:
> > Thread 1: A B - B C D
> > Thread 2: A B C B C D
>
> > Align based on a nested loop interpretation:
> > Thread 1: A B B C - - D
> > Thread 2: A B - C B C D
>
> (Explanation of the alignment: In the first alignment, %sum.b is
> always the sum of the input values from both threads. In the second
> alignment, the first computation of %sum.b is collaborative between
> the two threads; the second one in each thread simply returns the
> input value for that thread.)
>
> The example only has a single natural loop, but it could result from a
> high-level language source that has two nested do-while loops, so the
> second interpretation may well be what the programmer intended.
>
> It has often been proposed that the alignment should be defined purely
> based on the IR shown above, such as by always aligning based on
> natural loops. This doesn't work, even if we ignore the possibility of
> irreducible control flow. In the example, we could in principle
> generate IR as follows if the "nested loop alignment" is desired:
>
> A:
> br label %outer.hdr
>
> outer.hdr:
> br label %B
>
> B:
> ...
> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
> ...
> br i1 %cc1, label %B, label %C
>
> C:
> ...
> br i1 %cc2, label %outer.hdr, label %D
>
> D:
> ; loop exit
>
> From a single-threaded perspective, it would be correct to short-cut
> all paths through outer.hdr, but this results in the original IR and
> therefore different behavior for the convergent operation (under such
> a proposal). Worse, the part of the IR which makes the short-cutting
> incorrect -- i.e., the affected convergent operation -- is potentially
> far away from the paths that we're short-cutting. We want to avoid
> this "spooky action at a distance" because it puts an undue burden on
> generic code transforms.
>
>
> The Solution
> ------------
> We introduce explicit annotations in the IR using "convergence tokens"
> that "anchor" a convergent operation with respect to some other point
> in control flow or with respect to the function entry. The details are
> in the linked Phabricator review, so I will simply illustrate here how
> this solves the example above.
>
> To obtain the original natural loop alignment, we augment the example
> as follows:
>
> A:
> %anchor = call token @llvm.experimental.convergence.anchor()
> br label %B
>
> B:
> %loop = call token @llvm.experimental.convergence.loop() [
> "convergencectrl"(token %anchor) ]
> ...
> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
> %loop) ]
> ...
> br i1 %cc1, label %B, label %C
>
> C:
> ...
> br i1 %cc2, label %B, label %D
>
> D:
> ; loop exit
>
> To obtain the loop nest alignment, we augment it as follows:
>
> A:
> %anchor = call token @llvm.experimental.convergence.anchor()
> br label %outer.hdr
>
> outer.hdr:
> %outer = call token @llvm.experimental.convergence.loop() [
> "convergencectrl"(token %anchor) ]
> br label %B
>
> B:
> %inner = call token @llvm.experimental.convergence.loop() [
> "convergencectrl"(token %outer) ]
> ...
> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
> %inner) ]
> ...
> br i1 %cc1, label %B, label %C
>
> C:
> ...
> br i1 %cc2, label %outer.hdr, label %D
>
> D:
> ; loop exit
>
> We end up with two nested natural loops as before, but short-cutting
> through outer.hdr is easily seen as forbidden because of the "loop
> heart" intrinsic along the path through outer.hdr. (We call it a
> "heart" because intuitively that's where loop iterations are counted
> for the purpose of convergent operations, i.e. we count "heartbeats"
> of the loop.) Having the non-removable outer.hdr block may seem like
> an inefficiency, but at least in the AMDGPU backend we end up needing
> that basic block anyway to implement the desired behavior. We don't
> literally count loop iterations in the AMDGPU backend, but we end up
> doing something isomorphic that also requires the block. We suspect
> that other implementations will have similar requirements. In any
> case, backends are free to optimize the block away using
> target-specific handling.
>
>
> Have you tried other solutions?
> -------------------------------
> Yes, we've tried many things over the years. I've personally been
> thinking about and working on this problem on and off for a
> significant amount of time over the last four years. I supervised a
> practical master thesis on a related topic during this time. Other
> people have thought about the problem a lot as well. Discussions with
> many people in the LLVM community, inside AMD, and elsewhere have all
> helped our understanding of the problem. Many different ideas were
> considered and ultimately rejected in the process.
>
> All attempts to solve the problem _without_ some form of "convergence
> token" have suffered from spooky action at a distance. This includes
> the current definition of `convergent` in the LangRef.
>
> We tried _very_ hard to find a workable definition of `convergent`
> that doesn't talk about groups of threads in the LangRef, but
> ultimately came to the conclusion that thinking about convergent
> operations inherently involves threads of execution, and it's much
> easier to just accept that. As I've already written above, convergent
> operations really are a form of (often non- or inaccessible-memory)
> communication among threads where the set of communicating threads is
> affected by the static position of the operation in control flow.
>
> The proposal contains no lock-step execution, even though that's what
> many people think of when they think about GPU programming models. It
> is purely expressed in terms of communicating threads, which could in
> theory just be threads on a CPU. We believe that this is more in line
> with how LLVM works. Besides, we found lock-step execution semantics
> to be undesirable from first principles, because they prevent certain
> desirable program transforms or at least make them much more difficult
> to achieve.
>
> The most recent earlier proposal we brought to the LLVM community is
> here: https://reviews.llvm.org/D68994. That one already used a form of
> convergence token, but we ultimately rejected it for two main reasons.
> First, the convergence control intrinsics described there create more
> "noise" in the IR, i.e. there are more intrinsic calls than with what
> we have now. Second, the particular formalism of dynamic instances
> used there had some undesirable side effects that put undue burden on
> transforms that should be trivial like dead-code elimination.
>
> The current proposal still talks about dynamic instances, but there is
> no longer any automatic propagation of convergence: even if two
> threads execute the same dynamic instances of an instruction, whether
> they execute the same dynamic instance of the very next instruction in
> straight-line code is implementation-defined in general (which means
> that program transforms are free to make changes that could affect this).
>
> We chose to use an operand bundle for the convergence token so that
> generic code can identify the relevant token value easily. We
> considered adding the convergence token as a regular function call
> argument, but this makes it impractical for generic transforms to
> understand functions that take both a convergence token and some other
> token.
>
> Cheers,
> Nicolai
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
More information about the llvm-dev
mailing list