[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Nicolai Hähnle via llvm-dev
llvm-dev at lists.llvm.org
Tue Aug 18 01:44:00 PDT 2020
Hi Hal,
On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov> wrote:
> I agree that the new scheme seems better for several reasons. So it
> seems like your recommendation is that we consider the existing
> attribute usage deprecated?
Yes.
Some form of auto-upgrade could be attempted via the
ConvergenceControlHeuristic of https://reviews.llvm.org/D85609, though
that's slightly big for on-the-fly upgrading in the bitcode reader. So
while the groundwork for potentially doing it is there, I don't see a
strong need or desire for it at the moment.
Cheers,
Nicolai
> Or should we auto-upgrade it to something?
>
> -Hal
>
>
> >
> >
> >> -Hal
> >>
> >>
> >>>
> >>>> -Hal
> >>>>
> >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> please see https://reviews.llvm.org/D85603 and its related changes for
> >>>>> our most recent and hopefully final attempt at putting the
> >>>>> `convergent` attribute on a solid theoretical foundation in a way that
> >>>>> is useful for modern GPU compiler use cases. We have clear line of
> >>>>> sight to enabling a new control flow implementation in the AMDGPU
> >>>>> backend which is built on this foundation. I have started upstreaming
> >>>>> some ground work recently. We expect this foundation to also be usable
> >>>>> by non-GPU whole-program vectorization environments, if they choose to
> >>>>> do so.
> >>>>>
> >>>>> What follows is a (not quite) brief overview of what this is all
> >>>>> about, but do see the linked review and its "stack" for a fuller story.
> >>>>>
> >>>>>
> >>>>> The Problem
> >>>>> -----------
> >>>>> `convergent` was originally introduced in LLVM to label "barrier"
> >>>>> instructions in GPU programming languages. Those put some unusual
> >>>>> constraints on program transforms that weren't particularly well
> >>>>> understood at the time. Since then, GPU languages have introduced
> >>>>> additional instructions such as wave shuffles or subgroup reductions
> >>>>> that made the problem harder and exposed some shortcomings of the
> >>>>> current definition of `convergent`. At the same time, those additional
> >>>>> instructions helped illuminate what the problem really is.
> >>>>>
> >>>>> Briefly, in whole-program vectorization environments such as GPUs, we
> >>>>> can expose opportunities for performance optimizations to developers
> >>>>> by allowing data exchange between threads that are grouped together as
> >>>>> lanes in a vector. As long as control flow is uniform, i.e. all
> >>>>> threads of the vector follow the same path through the CFG, it is
> >>>>> natural that all threads assigned to the same vector participate in
> >>>>> this data exchange. When control flow diverges, some threads may be
> >>>>> unavailable for the data exchange, but we still need to allow data
> >>>>> exchange among the threads that _are_ available.
> >>>>>
> >>>>> This means that those data exchange operations -- i.e., `convergent`
> >>>>> operations -- communicate among a set of threads that implicitly
> >>>>> depends on the position of the operation in control flow.
> >>>>>
> >>>>> The problem boils down to defining how the set of communicating
> >>>>> threads is determined.
> >>>>>
> >>>>> This problem inherently has a large target- and environment-specific
> >>>>> component. Even if control flow is fully uniform, there is the
> >>>>> question of which threads are grouped together in the first place. In
> >>>>> the same way that LLVM IR has no built-in understanding of how threads
> >>>>> come to be in the first place (even in the more well-known
> >>>>> multi-threaded CPU programming world where convergent operations don't
> >>>>> exist), we don't concern ourselves at all with these types of
> >>>>> questions here. For the purpose of target-independent LLVM IR
> >>>>> semantics, we simply assume that they are answered _somehow_ and
> >>>>> instead focus on the part of the problem that relates to control flow.
> >>>>>
> >>>>> When looking at high-level language source, it is often "intuitively
> >>>>> obvious" which threads should communicate. This is not the case in an
> >>>>> unstructured CFG. Here's an example of a non-trivial loop where a
> >>>>> reduction is used (the result of the reduction is the sum of the input
> >>>>> values of all participating threads):
> >>>>>
> >>>>> A:
> >>>>> br label %B
> >>>>>
> >>>>> B:
> >>>>> ...
> >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
> >>>>> ...
> >>>>> br i1 %cc1, label %B, label %C
> >>>>>
> >>>>> C:
> >>>>> ...
> >>>>> br i1 %cc2, label %B, label %D
> >>>>>
> >>>>> D:
> >>>>> ; loop exit
> >>>>>
> >>>>> Suppose this code is executed by two threads grouped in a (very short)
> >>>>> vector, and the threads execute the following sequences of basic blocks:
> >>>>>
> >>>>>> Thread 1: A B B C D
> >>>>>> Thread 2: A B C B C D
> >>>>> There are two different intuitive ways of "aligning" the threads for
> >>>>> the purpose of communication between convergent operations:
> >>>>>
> >>>>>> Align based on natural loops:
> >>>>>> Thread 1: A B - B C D
> >>>>>> Thread 2: A B C B C D
> >>>>>> Align based on a nested loop interpretation:
> >>>>>> Thread 1: A B B C - - D
> >>>>>> Thread 2: A B - C B C D
> >>>>> (Explanation of the alignment: In the first alignment, %sum.b is
> >>>>> always the sum of the input values from both threads. In the second
> >>>>> alignment, the first computation of %sum.b is collaborative between
> >>>>> the two threads; the second one in each thread simply returns the
> >>>>> input value for that thread.)
> >>>>>
> >>>>> The example only has a single natural loop, but it could result from a
> >>>>> high-level language source that has two nested do-while loops, so the
> >>>>> second interpretation may well be what the programmer intended.
> >>>>>
> >>>>> It has often been proposed that the alignment should be defined purely
> >>>>> based on the IR shown above, such as by always aligning based on
> >>>>> natural loops. This doesn't work, even if we ignore the possibility of
> >>>>> irreducible control flow. In the example, we could in principle
> >>>>> generate IR as follows if the "nested loop alignment" is desired:
> >>>>>
> >>>>> A:
> >>>>> br label %outer.hdr
> >>>>>
> >>>>> outer.hdr:
> >>>>> br label %B
> >>>>>
> >>>>> B:
> >>>>> ...
> >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
> >>>>> ...
> >>>>> br i1 %cc1, label %B, label %C
> >>>>>
> >>>>> C:
> >>>>> ...
> >>>>> br i1 %cc2, label %outer.hdr, label %D
> >>>>>
> >>>>> D:
> >>>>> ; loop exit
> >>>>>
> >>>>> From a single-threaded perspective, it would be correct to short-cut
> >>>>> all paths through outer.hdr, but this results in the original IR and
> >>>>> therefore different behavior for the convergent operation (under such
> >>>>> a proposal). Worse, the part of the IR which makes the short-cutting
> >>>>> incorrect -- i.e., the affected convergent operation -- is potentially
> >>>>> far away from the paths that we're short-cutting. We want to avoid
> >>>>> this "spooky action at a distance" because it puts an undue burden on
> >>>>> generic code transforms.
> >>>>>
> >>>>>
> >>>>> The Solution
> >>>>> ------------
> >>>>> We introduce explicit annotations in the IR using "convergence tokens"
> >>>>> that "anchor" a convergent operation with respect to some other point
> >>>>> in control flow or with respect to the function entry. The details are
> >>>>> in the linked Phabricator review, so I will simply illustrate here how
> >>>>> this solves the example above.
> >>>>>
> >>>>> To obtain the original natural loop alignment, we augment the example
> >>>>> as follows:
> >>>>>
> >>>>> A:
> >>>>> %anchor = call token @llvm.experimental.convergence.anchor()
> >>>>> br label %B
> >>>>>
> >>>>> B:
> >>>>> %loop = call token @llvm.experimental.convergence.loop() [
> >>>>> "convergencectrl"(token %anchor) ]
> >>>>> ...
> >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
> >>>>> %loop) ]
> >>>>> ...
> >>>>> br i1 %cc1, label %B, label %C
> >>>>>
> >>>>> C:
> >>>>> ...
> >>>>> br i1 %cc2, label %B, label %D
> >>>>>
> >>>>> D:
> >>>>> ; loop exit
> >>>>>
> >>>>> To obtain the loop nest alignment, we augment it as follows:
> >>>>>
> >>>>> A:
> >>>>> %anchor = call token @llvm.experimental.convergence.anchor()
> >>>>> br label %outer.hdr
> >>>>>
> >>>>> outer.hdr:
> >>>>> %outer = call token @llvm.experimental.convergence.loop() [
> >>>>> "convergencectrl"(token %anchor) ]
> >>>>> br label %B
> >>>>>
> >>>>> B:
> >>>>> %inner = call token @llvm.experimental.convergence.loop() [
> >>>>> "convergencectrl"(token %outer) ]
> >>>>> ...
> >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
> >>>>> %inner) ]
> >>>>> ...
> >>>>> br i1 %cc1, label %B, label %C
> >>>>>
> >>>>> C:
> >>>>> ...
> >>>>> br i1 %cc2, label %outer.hdr, label %D
> >>>>>
> >>>>> D:
> >>>>> ; loop exit
> >>>>>
> >>>>> We end up with two nested natural loops as before, but short-cutting
> >>>>> through outer.hdr is easily seen as forbidden because of the "loop
> >>>>> heart" intrinsic along the path through outer.hdr. (We call it a
> >>>>> "heart" because intuitively that's where loop iterations are counted
> >>>>> for the purpose of convergent operations, i.e. we count "heartbeats"
> >>>>> of the loop.) Having the non-removable outer.hdr block may seem like
> >>>>> an inefficiency, but at least in the AMDGPU backend we end up needing
> >>>>> that basic block anyway to implement the desired behavior. We don't
> >>>>> literally count loop iterations in the AMDGPU backend, but we end up
> >>>>> doing something isomorphic that also requires the block. We suspect
> >>>>> that other implementations will have similar requirements. In any
> >>>>> case, backends are free to optimize the block away using
> >>>>> target-specific handling.
> >>>>>
> >>>>>
> >>>>> Have you tried other solutions?
> >>>>> -------------------------------
> >>>>> Yes, we've tried many things over the years. I've personally been
> >>>>> thinking about and working on this problem on and off for a
> >>>>> significant amount of time over the last four years. I supervised a
> >>>>> practical master thesis on a related topic during this time. Other
> >>>>> people have thought about the problem a lot as well. Discussions with
> >>>>> many people in the LLVM community, inside AMD, and elsewhere have all
> >>>>> helped our understanding of the problem. Many different ideas were
> >>>>> considered and ultimately rejected in the process.
> >>>>>
> >>>>> All attempts to solve the problem _without_ some form of "convergence
> >>>>> token" have suffered from spooky action at a distance. This includes
> >>>>> the current definition of `convergent` in the LangRef.
> >>>>>
> >>>>> We tried _very_ hard to find a workable definition of `convergent`
> >>>>> that doesn't talk about groups of threads in the LangRef, but
> >>>>> ultimately came to the conclusion that thinking about convergent
> >>>>> operations inherently involves threads of execution, and it's much
> >>>>> easier to just accept that. As I've already written above, convergent
> >>>>> operations really are a form of (often non- or inaccessible-memory)
> >>>>> communication among threads where the set of communicating threads is
> >>>>> affected by the static position of the operation in control flow.
> >>>>>
> >>>>> The proposal contains no lock-step execution, even though that's what
> >>>>> many people think of when they think about GPU programming models. It
> >>>>> is purely expressed in terms of communicating threads, which could in
> >>>>> theory just be threads on a CPU. We believe that this is more in line
> >>>>> with how LLVM works. Besides, we found lock-step execution semantics
> >>>>> to be undesirable from first principles, because they prevent certain
> >>>>> desirable program transforms or at least make them much more difficult
> >>>>> to achieve.
> >>>>>
> >>>>> The most recent earlier proposal we brought to the LLVM community is
> >>>>> here: https://reviews.llvm.org/D68994. That one already used a form of
> >>>>> convergence token, but we ultimately rejected it for two main reasons.
> >>>>> First, the convergence control intrinsics described there create more
> >>>>> "noise" in the IR, i.e. there are more intrinsic calls than with what
> >>>>> we have now. Second, the particular formalism of dynamic instances
> >>>>> used there had some undesirable side effects that put undue burden on
> >>>>> transforms that should be trivial like dead-code elimination.
> >>>>>
> >>>>> The current proposal still talks about dynamic instances, but there is
> >>>>> no longer any automatic propagation of convergence: even if two
> >>>>> threads execute the same dynamic instances of an instruction, whether
> >>>>> they execute the same dynamic instance of the very next instruction in
> >>>>> straight-line code is implementation-defined in general (which means
> >>>>> that program transforms are free to make changes that could affect this).
> >>>>>
> >>>>> We chose to use an operand bundle for the convergence token so that
> >>>>> generic code can identify the relevant token value easily. We
> >>>>> considered adding the convergence token as a regular function call
> >>>>> argument, but this makes it impractical for generic transforms to
> >>>>> understand functions that take both a convergence token and some other
> >>>>> token.
> >>>>>
> >>>>> Cheers,
> >>>>> Nicolai
> >>>> --
> >>>> Hal Finkel
> >>>> Lead, Compiler Technology and Programming Languages
> >>>> Leadership Computing Facility
> >>>> Argonne National Laboratory
> >>>>
> >> --
> >> Hal Finkel
> >> Lead, Compiler Technology and Programming Languages
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
> >
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
--
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.
More information about the llvm-dev
mailing list