[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics

Mon Aug 17 10:52:07 PDT 2020

On Mon, Aug 17, 2020 at 7:14 PM Hal Finkel <hfinkel at anl.gov> wrote:
> On 8/17/20 11:51 AM, Nicolai Hähnle wrote:
> > Hi Hal,
> >
> > On Mon, Aug 17, 2020 at 2:13 AM Hal Finkel <hfinkel at anl.gov> wrote:
> >> Thanks for sending this. What do you think that we should do with the
> >> existing convergent attribute?
> > My preference, which is implicitly expressed in the review, is to use
> > `convergent` both for the new and the old thing. They are implicitly
> > distinguished via the "convergencectrl" operand bundle.
> >
> > The main reason for going down that route is that `convergent` is
> > really an odd duck in that it is the only attribute in LLVM that
> > _adds_ new constraints on program transforms. All other attributes
> > _remove_ constraints on program transforms by expressing constraints
> > on what the function does (e.g. readnone, nosync, willreturn, ...).
> > Matt tried to push for changing that, making the "convergent"
> > restrictions the default and adding `noconvergent` on most things, but
> > that effort got stuck. In the meantime, let's avoid adding yet another
> > odd duck like this :)
> >
> > Cheers,
> > Nicolai
>
>
> Understood. Does the existing attribute have a sound definition, just
> not one that's useful for (all) GPU use cases? Or is the existing
> attribute not really sound and all use cases eventually need to be replaced?

The existing definition is insufficient for some GPU use cases, but it
also has a problem that I like to call spooky action at a distance:
some program transforms may have to look at code which they'd normally
not look at in order to correctly determine whether they're correct
for the current definition of `convergent`. The really problematic one
that I'm aware of in practice is jump threading -- the implementation
of that is incorrect wrt `convergent` -- but perhaps others are
affected as well.

Cheers,
Nicolai

>
>   -Hal
>
>
> >
> >
> >>    -Hal
> >>
> >> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
> >>> Hi all,
> >>>
> >>> please see https://reviews.llvm.org/D85603 and its related changes for
> >>> our most recent and hopefully final attempt at putting the
> >>> `convergent` attribute on a solid theoretical foundation in a way that
> >>> is useful for modern GPU compiler use cases. We have clear line of
> >>> sight to enabling a new control flow implementation in the AMDGPU
> >>> backend which is built on this foundation. I have started upstreaming
> >>> some ground work recently. We expect this foundation to also be usable
> >>> by non-GPU whole-program vectorization environments, if they choose to
> >>> do so.
> >>>
> >>> What follows is a (not quite) brief overview of what this is all
> >>> about, but do see the linked review and its "stack" for a fuller story.
> >>>
> >>>
> >>> The Problem
> >>> -----------
> >>> `convergent` was originally introduced in LLVM to label "barrier"
> >>> instructions in GPU programming languages. Those put some unusual
> >>> constraints on program transforms that weren't particularly well
> >>> understood at the time. Since then, GPU languages have introduced
> >>> additional instructions such as wave shuffles or subgroup reductions
> >>> that made the problem harder and exposed some shortcomings of the
> >>> current definition of `convergent`. At the same time, those additional
> >>> instructions helped illuminate what the problem really is.
> >>>
> >>> Briefly, in whole-program vectorization environments such as GPUs, we
> >>> can expose opportunities for performance optimizations to developers
> >>> by allowing data exchange between threads that are grouped together as
> >>> lanes in a vector. As long as control flow is uniform, i.e. all
> >>> threads of the vector follow the same path through the CFG, it is
> >>> natural that all threads assigned to the same vector participate in
> >>> this data exchange. When control flow diverges, some threads may be
> >>> unavailable for the data exchange, but we still need to allow data
> >>> exchange among the threads that _are_ available.
> >>>
> >>> This means that those data exchange operations -- i.e., `convergent`
> >>> operations -- communicate among a set of threads that implicitly
> >>> depends on the position of the operation in control flow.
> >>>
> >>> The problem boils down to defining how the set of communicating
> >>> threads is determined.
> >>>
> >>> This problem inherently has a large target- and environment-specific
> >>> component. Even if control flow is fully uniform, there is the
> >>> question of which threads are grouped together in the first place. In
> >>> the same way that LLVM IR has no built-in understanding of how threads
> >>> come to be in the first place (even in the more well-known
> >>> multi-threaded CPU programming world where convergent operations don't
> >>> exist), we don't concern ourselves at all with these types of
> >>> questions here. For the purpose of target-independent LLVM IR
> >>> semantics, we simply assume that they are answered _somehow_ and
> >>> instead focus on the part of the problem that relates to control flow.
> >>>
> >>> When looking at high-level language source, it is often "intuitively
> >>> obvious" which threads should communicate. This is not the case in an
> >>> unstructured CFG. Here's an example of a non-trivial loop where a
> >>> reduction is used (the result of the reduction is the sum of the input
> >>> values of all participating threads):
> >>>
> >>>    A:
> >>>      br label %B
> >>>
> >>>    B:
> >>>      ...
> >>>      %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
> >>>      ...
> >>>      br i1 %cc1, label %B, label %C
> >>>
> >>>    C:
> >>>      ...
> >>>      br i1 %cc2, label %B, label %D
> >>>
> >>>    D:
> >>>      ; loop exit
> >>>
> >>> Suppose this code is executed by two threads grouped in a (very short)
> >>> vector, and the threads execute the following sequences of basic blocks:
> >>>
> >>>>   Thread 1: A B B C D
> >>>>   Thread 2: A B C B C D
> >>> There are two different intuitive ways of "aligning" the threads for
> >>> the purpose of communication between convergent operations:
> >>>
> >>>>   Align based on natural loops:
> >>>>     Thread 1: A B - B C D
> >>>>     Thread 2: A B C B C D
> >>>>   Align based on a nested loop interpretation:
> >>>>     Thread 1: A B B C - - D
> >>>>     Thread 2: A B - C B C D
> >>> (Explanation of the alignment: In the first alignment, %sum.b is
> >>> always the sum of the input values from both threads. In the second
> >>> alignment, the first computation of %sum.b is collaborative between
> >>> the two threads; the second one in each thread simply returns the
> >>> input value for that thread.)
> >>>
> >>> The example only has a single natural loop, but it could result from a
> >>> high-level language source that has two nested do-while loops, so the
> >>> second interpretation may well be what the programmer intended.
> >>>
> >>> It has often been proposed that the alignment should be defined purely
> >>> based on the IR shown above, such as by always aligning based on
> >>> natural loops. This doesn't work, even if we ignore the possibility of
> >>> irreducible control flow. In the example, we could in principle
> >>> generate IR as follows if the "nested loop alignment" is desired:
> >>>
> >>>    A:
> >>>      br label %outer.hdr
> >>>
> >>>    outer.hdr:
> >>>      br label %B
> >>>
> >>>    B:
> >>>      ...
> >>>      %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
> >>>      ...
> >>>      br i1 %cc1, label %B, label %C
> >>>
> >>>    C:
> >>>      ...
> >>>      br i1 %cc2, label %outer.hdr, label %D
> >>>
> >>>    D:
> >>>      ; loop exit
> >>>
> >>>  From a single-threaded perspective, it would be correct to short-cut
> >>> all paths through outer.hdr, but this results in the original IR and
> >>> therefore different behavior for the convergent operation (under such
> >>> a proposal). Worse, the part of the IR which makes the short-cutting
> >>> incorrect -- i.e., the affected convergent operation -- is potentially
> >>> far away from the paths that we're short-cutting. We want to avoid
> >>> this "spooky action at a distance" because it puts an undue burden on
> >>> generic code transforms.
> >>>
> >>>
> >>> The Solution
> >>> ------------
> >>> We introduce explicit annotations in the IR using "convergence tokens"
> >>> that "anchor" a convergent operation with respect to some other point
> >>> in control flow or with respect to the function entry. The details are
> >>> in the linked Phabricator review, so I will simply illustrate here how
> >>> this solves the example above.
> >>>
> >>> To obtain the original natural loop alignment, we augment the example
> >>> as follows:
> >>>
> >>>    A:
> >>>      %anchor = call token @llvm.experimental.convergence.anchor()
> >>>      br label %B
> >>>
> >>>    B:
> >>>      %loop = call token @llvm.experimental.convergence.loop() [
> >>> "convergencectrl"(token %anchor) ]
> >>>      ...
> >>>      %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
> >>> %loop) ]
> >>>      ...
> >>>      br i1 %cc1, label %B, label %C
> >>>
> >>>    C:
> >>>      ...
> >>>      br i1 %cc2, label %B, label %D
> >>>
> >>>    D:
> >>>      ; loop exit
> >>>
> >>> To obtain the loop nest alignment, we augment it as follows:
> >>>
> >>>    A:
> >>>      %anchor = call token @llvm.experimental.convergence.anchor()
> >>>      br label %outer.hdr
> >>>
> >>>    outer.hdr:
> >>>      %outer = call token @llvm.experimental.convergence.loop() [
> >>> "convergencectrl"(token %anchor) ]
> >>>      br label %B
> >>>
> >>>    B:
> >>>      %inner = call token @llvm.experimental.convergence.loop() [
> >>> "convergencectrl"(token %outer) ]
> >>>      ...
> >>>      %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
> >>> %inner) ]
> >>>      ...
> >>>      br i1 %cc1, label %B, label %C
> >>>
> >>>    C:
> >>>      ...
> >>>      br i1 %cc2, label %outer.hdr, label %D
> >>>
> >>>    D:
> >>>      ; loop exit
> >>>
> >>> We end up with two nested natural loops as before, but short-cutting
> >>> through outer.hdr is easily seen as forbidden because of the "loop
> >>> heart" intrinsic along the path through outer.hdr. (We call it a
> >>> "heart" because intuitively that's where loop iterations are counted
> >>> for the purpose of convergent operations, i.e. we count "heartbeats"
> >>> of the loop.) Having the non-removable outer.hdr block may seem like
> >>> an inefficiency, but at least in the AMDGPU backend we end up needing
> >>> that basic block anyway to implement the desired behavior. We don't
> >>> literally count loop iterations in the AMDGPU backend, but we end up
> >>> doing something isomorphic that also requires the block. We suspect
> >>> that other implementations will have similar requirements. In any
> >>> case, backends are free to optimize the block away using
> >>> target-specific handling.
> >>>
> >>>
> >>> Have you tried other solutions?
> >>> -------------------------------
> >>> Yes, we've tried many things over the years. I've personally been
> >>> thinking about and working on this problem on and off for a
> >>> significant amount of time over the last four years. I supervised a
> >>> practical master thesis on a related topic during this time. Other
> >>> people have thought about the problem a lot as well. Discussions with
> >>> many people in the LLVM community, inside AMD, and elsewhere have all
> >>> helped our understanding of the problem. Many different ideas were
> >>> considered and ultimately rejected in the process.
> >>>
> >>> All attempts to solve the problem _without_ some form of "convergence
> >>> token" have suffered from spooky action at a distance. This includes
> >>> the current definition of `convergent` in the LangRef.
> >>>
> >>> We tried _very_ hard to find a workable definition of `convergent`
> >>> that doesn't talk about groups of threads in the LangRef, but
> >>> ultimately came to the conclusion that thinking about convergent
> >>> operations inherently involves threads of execution, and it's much
> >>> easier to just accept that. As I've already written above, convergent
> >>> operations really are a form of (often non- or inaccessible-memory)
> >>> communication among threads where the set of communicating threads is
> >>> affected by the static position of the operation in control flow.
> >>>
> >>> The proposal contains no lock-step execution, even though that's what
> >>> many people think of when they think about GPU programming models. It
> >>> is purely expressed in terms of communicating threads, which could in
> >>> theory just be threads on a CPU. We believe that this is more in line
> >>> with how LLVM works. Besides, we found lock-step execution semantics
> >>> to be undesirable from first principles, because they prevent certain
> >>> desirable program transforms or at least make them much more difficult
> >>> to achieve.
> >>>
> >>> The most recent earlier proposal we brought to the LLVM community is
> >>> here: https://reviews.llvm.org/D68994. That one already used a form of
> >>> convergence token, but we ultimately rejected it for two main reasons.
> >>> First, the convergence control intrinsics described there create more
> >>> "noise" in the IR, i.e. there are more intrinsic calls than with what
> >>> we have now. Second, the particular formalism of dynamic instances
> >>> used there had some undesirable side effects that put undue burden on
> >>> transforms that should be trivial like dead-code elimination.
> >>>
> >>> The current proposal still talks about dynamic instances, but there is
> >>> no longer any automatic propagation of convergence: even if two
> >>> threads execute the same dynamic instances of an instruction, whether
> >>> they execute the same dynamic instance of the very next instruction in
> >>> straight-line code is implementation-defined in general (which means
> >>> that program transforms are free to make changes that could affect this).
> >>>
> >>> We chose to use an operand bundle for the convergence token so that
> >>> generic code can identify the relevant token value easily. We
> >>> considered adding the convergence token as a regular function call
> >>> argument, but this makes it impractical for generic transforms to
> >>> understand functions that take both a convergence token and some other
> >>> token.
> >>>
> >>> Cheers,
> >>> Nicolai
> >> --
> >> Hal Finkel
> >> Lead, Compiler Technology and Programming Languages
> >> Leadership Computing Facility
> >> Argonne National Laboratory
> >>
> >
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.