[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics

Fri Oct 30 05:08:57 PDT 2020

On 29.10.20 23:33, Reid Kleckner wrote:
> I watched the talk on the design, and it all made sense to me. I can't 
> claim to have a deep knowledge of the requirements of GPU architectures, 
> but I can say that this is basically the kind of stuff we had in mind 
> when the token type was designed. What you are saying about modelling 
> these special GPU operations as accessing inaccessible memory makes 
> sense to me, but again, I am not an expert.
> 
> One of the challenges we've faced when trying to create regions for 
> unpacked call sequences[1] is that unreachable code elimination can 
> often "break" a region by deleting the token consumer. It's not clear if 
> your proposal suffers from this problem, but it's something to keep in 
> mind, since optimizers can often discover that a plain store is storing 
> to a null pointer, that can become unreachable, and there goes the 
> entire rest of the function, leaving you with a half-open region. I 
> can't imagine a plausible series of transforms on a reasonable GPU 
> program would end up in this situation, though.
> [1] http://lists.llvm.org/pipermail/llvm-dev/2020-January/138627.html

Thanks for the feedback, Reid. I try not to make assumptions about what 
is a "reasonable" GPU program ;)

One advantage that we have here is that there is no need to explicitly 
"end" a token region. Rather, we have a single token producer and 0 to 
any number of consumers. Deleting a consumer doesn't make a difference 
for the semantic guarantees we make for any of the non-deleted code. At 
least, I've thought about this quite a fair bit, including some very 
incomplete proof sketches, and I'm pretty confident about it at this point.

For what it's worth, `unreachable` and multiple return statements are 
things we see in GPU code in practice, and simple support of dead-code 
elimination was an explicit goal.

One interesting kind of thing can happen and needs to be kept in mind 
when the `anchor` intrinsic is used, because this intrinsic is largely 
implementation-defined. If you have:

   %tok = call token @llvm.experimental.convergence.anchor()

   ... lots of code with control flow ...

   call void @convergent_op() [ "convergencectrl"(token %tok) ]

   ...

   %tok2 = call token @llvm.experiment.convergence.anchor()
   call void @second_op() [ "convergencectrl"(token %tok2) ]

Deleting the @convergent_op could potentially cause a different set of 
threads to arrive together at the second anchor and for the @second_op 
during execution in practice. Whether the deletion makes a difference or 
not depends on the details of the underlying divergence / reconvergence 
mechanisms.

That's okay though, because it just ends up being a different refinement 
of the original IR semantics. Allowing this kind of freedom (even 
including non-determinism) is exactly the point of the `anchor` 
intrinsic, and if the programmer/frontend didn't want that, some 
alternative structure (typically based on `entry`) would have to be used 
instead.

Cheers,
Nicolai

> 
> On Wed, Oct 28, 2020 at 3:14 PM Nicolai Hähnle via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> 
>     Hi all,
> 
>     As there has been quiet on the topic for quite some time now, I'm
>     pinging this thread to see if ideally we can just go ahead with the
>     current proposal, or if/how more discussion is required.
> 
>     Quick pointer to the proposed LangRef changes for the `convergent`
>     attribute is here: https://reviews.llvm.org/D85603
> 
>     The presentation from the dev meeting three weeks ago is here:
>     https://www.youtube.com/watch?v=_Z5DuiVCFAw
> 
>     Thanks,
>     Nicolai
> 
> 
>     On Tue, Aug 18, 2020 at 10:44 AM Nicolai Hähnle <nhaehnle at gmail.com
>     <mailto:nhaehnle at gmail.com>> wrote:
>      >
>      > Hi Hal,
>      >
>      > On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov
>     <mailto:hfinkel at anl.gov>> wrote:
>      > > I agree that the new scheme seems better for several reasons. So it
>      > > seems like your recommendation is that we consider the existing
>      > > attribute usage deprecated?
>      >
>      > Yes.
>      >
>      > Some form of auto-upgrade could be attempted via the
>      > ConvergenceControlHeuristic of https://reviews.llvm.org/D85609,
>     though
>      > that's slightly big for on-the-fly upgrading in the bitcode
>     reader. So
>      > while the groundwork for potentially doing it is there, I don't see a
>      > strong need or desire for it at the moment.
>      >
>      > Cheers,
>      > Nicolai
>      >
>      >
>      > > Or should we auto-upgrade it to something?
>      > >
>      > >   -Hal
>      > >
>      > >
>      > > >
>      > > >
>      > > >>    -Hal
>      > > >>
>      > > >>
>      > > >>>
>      > > >>>>     -Hal
>      > > >>>>
>      > > >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
>      > > >>>>> Hi all,
>      > > >>>>>
>      > > >>>>> please see https://reviews.llvm.org/D85603 and its
>     related changes for
>      > > >>>>> our most recent and hopefully final attempt at putting the
>      > > >>>>> `convergent` attribute on a solid theoretical foundation
>     in a way that
>      > > >>>>> is useful for modern GPU compiler use cases. We have
>     clear line of
>      > > >>>>> sight to enabling a new control flow implementation in
>     the AMDGPU
>      > > >>>>> backend which is built on this foundation. I have started
>     upstreaming
>      > > >>>>> some ground work recently. We expect this foundation to
>     also be usable
>      > > >>>>> by non-GPU whole-program vectorization environments, if
>     they choose to
>      > > >>>>> do so.
>      > > >>>>>
>      > > >>>>> What follows is a (not quite) brief overview of what this
>     is all
>      > > >>>>> about, but do see the linked review and its "stack" for a
>     fuller story.
>      > > >>>>>
>      > > >>>>>
>      > > >>>>> The Problem
>      > > >>>>> -----------
>      > > >>>>> `convergent` was originally introduced in LLVM to label
>     "barrier"
>      > > >>>>> instructions in GPU programming languages. Those put some
>     unusual
>      > > >>>>> constraints on program transforms that weren't
>     particularly well
>      > > >>>>> understood at the time. Since then, GPU languages have
>     introduced
>      > > >>>>> additional instructions such as wave shuffles or subgroup
>     reductions
>      > > >>>>> that made the problem harder and exposed some
>     shortcomings of the
>      > > >>>>> current definition of `convergent`. At the same time,
>     those additional
>      > > >>>>> instructions helped illuminate what the problem really is.
>      > > >>>>>
>      > > >>>>> Briefly, in whole-program vectorization environments such
>     as GPUs, we
>      > > >>>>> can expose opportunities for performance optimizations to
>     developers
>      > > >>>>> by allowing data exchange between threads that are
>     grouped together as
>      > > >>>>> lanes in a vector. As long as control flow is uniform,
>     i.e. all
>      > > >>>>> threads of the vector follow the same path through the
>     CFG, it is
>      > > >>>>> natural that all threads assigned to the same vector
>     participate in
>      > > >>>>> this data exchange. When control flow diverges, some
>     threads may be
>      > > >>>>> unavailable for the data exchange, but we still need to
>     allow data
>      > > >>>>> exchange among the threads that _are_ available.
>      > > >>>>>
>      > > >>>>> This means that those data exchange operations -- i.e.,
>     `convergent`
>      > > >>>>> operations -- communicate among a set of threads that
>     implicitly
>      > > >>>>> depends on the position of the operation in control flow.
>      > > >>>>>
>      > > >>>>> The problem boils down to defining how the set of
>     communicating
>      > > >>>>> threads is determined.
>      > > >>>>>
>      > > >>>>> This problem inherently has a large target- and
>     environment-specific
>      > > >>>>> component. Even if control flow is fully uniform, there
>     is the
>      > > >>>>> question of which threads are grouped together in the
>     first place. In
>      > > >>>>> the same way that LLVM IR has no built-in understanding
>     of how threads
>      > > >>>>> come to be in the first place (even in the more well-known
>      > > >>>>> multi-threaded CPU programming world where convergent
>     operations don't
>      > > >>>>> exist), we don't concern ourselves at all with these types of
>      > > >>>>> questions here. For the purpose of target-independent LLVM IR
>      > > >>>>> semantics, we simply assume that they are answered
>     _somehow_ and
>      > > >>>>> instead focus on the part of the problem that relates to
>     control flow.
>      > > >>>>>
>      > > >>>>> When looking at high-level language source, it is often
>     "intuitively
>      > > >>>>> obvious" which threads should communicate. This is not
>     the case in an
>      > > >>>>> unstructured CFG. Here's an example of a non-trivial loop
>     where a
>      > > >>>>> reduction is used (the result of the reduction is the sum
>     of the input
>      > > >>>>> values of all participating threads):
>      > > >>>>>
>      > > >>>>>     A:
>      > > >>>>>       br label %B
>      > > >>>>>
>      > > >>>>>     B:
>      > > >>>>>       ...
>      > > >>>>>       %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc1, label %B, label %C
>      > > >>>>>
>      > > >>>>>     C:
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc2, label %B, label %D
>      > > >>>>>
>      > > >>>>>     D:
>      > > >>>>>       ; loop exit
>      > > >>>>>
>      > > >>>>> Suppose this code is executed by two threads grouped in a
>     (very short)
>      > > >>>>> vector, and the threads execute the following sequences
>     of basic blocks:
>      > > >>>>>
>      > > >>>>>>    Thread 1: A B B C D
>      > > >>>>>>    Thread 2: A B C B C D
>      > > >>>>> There are two different intuitive ways of "aligning" the
>     threads for
>      > > >>>>> the purpose of communication between convergent operations:
>      > > >>>>>
>      > > >>>>>>    Align based on natural loops:
>      > > >>>>>>      Thread 1: A B - B C D
>      > > >>>>>>      Thread 2: A B C B C D
>      > > >>>>>>    Align based on a nested loop interpretation:
>      > > >>>>>>      Thread 1: A B B C - - D
>      > > >>>>>>      Thread 2: A B - C B C D
>      > > >>>>> (Explanation of the alignment: In the first alignment,
>     %sum.b is
>      > > >>>>> always the sum of the input values from both threads. In
>     the second
>      > > >>>>> alignment, the first computation of %sum.b is
>     collaborative between
>      > > >>>>> the two threads; the second one in each thread simply
>     returns the
>      > > >>>>> input value for that thread.)
>      > > >>>>>
>      > > >>>>> The example only has a single natural loop, but it could
>     result from a
>      > > >>>>> high-level language source that has two nested do-while
>     loops, so the
>      > > >>>>> second interpretation may well be what the programmer
>     intended.
>      > > >>>>>
>      > > >>>>> It has often been proposed that the alignment should be
>     defined purely
>      > > >>>>> based on the IR shown above, such as by always aligning
>     based on
>      > > >>>>> natural loops. This doesn't work, even if we ignore the
>     possibility of
>      > > >>>>> irreducible control flow. In the example, we could in
>     principle
>      > > >>>>> generate IR as follows if the "nested loop alignment" is
>     desired:
>      > > >>>>>
>      > > >>>>>     A:
>      > > >>>>>       br label %outer.hdr
>      > > >>>>>
>      > > >>>>>     outer.hdr:
>      > > >>>>>       br label %B
>      > > >>>>>
>      > > >>>>>     B:
>      > > >>>>>       ...
>      > > >>>>>       %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc1, label %B, label %C
>      > > >>>>>
>      > > >>>>>     C:
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc2, label %outer.hdr, label %D
>      > > >>>>>
>      > > >>>>>     D:
>      > > >>>>>       ; loop exit
>      > > >>>>>
>      > > >>>>>   From a single-threaded perspective, it would be correct
>     to short-cut
>      > > >>>>> all paths through outer.hdr, but this results in the
>     original IR and
>      > > >>>>> therefore different behavior for the convergent operation
>     (under such
>      > > >>>>> a proposal). Worse, the part of the IR which makes the
>     short-cutting
>      > > >>>>> incorrect -- i.e., the affected convergent operation --
>     is potentially
>      > > >>>>> far away from the paths that we're short-cutting. We want
>     to avoid
>      > > >>>>> this "spooky action at a distance" because it puts an
>     undue burden on
>      > > >>>>> generic code transforms.
>      > > >>>>>
>      > > >>>>>
>      > > >>>>> The Solution
>      > > >>>>> ------------
>      > > >>>>> We introduce explicit annotations in the IR using
>     "convergence tokens"
>      > > >>>>> that "anchor" a convergent operation with respect to some
>     other point
>      > > >>>>> in control flow or with respect to the function entry.
>     The details are
>      > > >>>>> in the linked Phabricator review, so I will simply
>     illustrate here how
>      > > >>>>> this solves the example above.
>      > > >>>>>
>      > > >>>>> To obtain the original natural loop alignment, we augment
>     the example
>      > > >>>>> as follows:
>      > > >>>>>
>      > > >>>>>     A:
>      > > >>>>>       %anchor = call token
>     @llvm.experimental.convergence.anchor()
>      > > >>>>>       br label %B
>      > > >>>>>
>      > > >>>>>     B:
>      > > >>>>>       %loop = call token
>     @llvm.experimental.convergence.loop() [
>      > > >>>>> "convergencectrl"(token %anchor) ]
>      > > >>>>>       ...
>      > > >>>>>       %sum.b = call i32 @subgroupAdd(i32 %v) [
>     "convergencectrl"(token
>      > > >>>>> %loop) ]
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc1, label %B, label %C
>      > > >>>>>
>      > > >>>>>     C:
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc2, label %B, label %D
>      > > >>>>>
>      > > >>>>>     D:
>      > > >>>>>       ; loop exit
>      > > >>>>>
>      > > >>>>> To obtain the loop nest alignment, we augment it as follows:
>      > > >>>>>
>      > > >>>>>     A:
>      > > >>>>>       %anchor = call token
>     @llvm.experimental.convergence.anchor()
>      > > >>>>>       br label %outer.hdr
>      > > >>>>>
>      > > >>>>>     outer.hdr:
>      > > >>>>>       %outer = call token
>     @llvm.experimental.convergence.loop() [
>      > > >>>>> "convergencectrl"(token %anchor) ]
>      > > >>>>>       br label %B
>      > > >>>>>
>      > > >>>>>     B:
>      > > >>>>>       %inner = call token
>     @llvm.experimental.convergence.loop() [
>      > > >>>>> "convergencectrl"(token %outer) ]
>      > > >>>>>       ...
>      > > >>>>>       %sum.b = call i32 @subgroupAdd(i32 %v) [
>     "convergencectrl"(token
>      > > >>>>> %inner) ]
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc1, label %B, label %C
>      > > >>>>>
>      > > >>>>>     C:
>      > > >>>>>       ...
>      > > >>>>>       br i1 %cc2, label %outer.hdr, label %D
>      > > >>>>>
>      > > >>>>>     D:
>      > > >>>>>       ; loop exit
>      > > >>>>>
>      > > >>>>> We end up with two nested natural loops as before, but
>     short-cutting
>      > > >>>>> through outer.hdr is easily seen as forbidden because of
>     the "loop
>      > > >>>>> heart" intrinsic along the path through outer.hdr. (We
>     call it a
>      > > >>>>> "heart" because intuitively that's where loop iterations
>     are counted
>      > > >>>>> for the purpose of convergent operations, i.e. we count
>     "heartbeats"
>      > > >>>>> of the loop.) Having the non-removable outer.hdr block
>     may seem like
>      > > >>>>> an inefficiency, but at least in the AMDGPU backend we
>     end up needing
>      > > >>>>> that basic block anyway to implement the desired
>     behavior. We don't
>      > > >>>>> literally count loop iterations in the AMDGPU backend,
>     but we end up
>      > > >>>>> doing something isomorphic that also requires the block.
>     We suspect
>      > > >>>>> that other implementations will have similar
>     requirements. In any
>      > > >>>>> case, backends are free to optimize the block away using
>      > > >>>>> target-specific handling.
>      > > >>>>>
>      > > >>>>>
>      > > >>>>> Have you tried other solutions?
>      > > >>>>> -------------------------------
>      > > >>>>> Yes, we've tried many things over the years. I've
>     personally been
>      > > >>>>> thinking about and working on this problem on and off for a
>      > > >>>>> significant amount of time over the last four years. I
>     supervised a
>      > > >>>>> practical master thesis on a related topic during this
>     time. Other
>      > > >>>>> people have thought about the problem a lot as well.
>     Discussions with
>      > > >>>>> many people in the LLVM community, inside AMD, and
>     elsewhere have all
>      > > >>>>> helped our understanding of the problem. Many different
>     ideas were
>      > > >>>>> considered and ultimately rejected in the process.
>      > > >>>>>
>      > > >>>>> All attempts to solve the problem _without_ some form of
>     "convergence
>      > > >>>>> token" have suffered from spooky action at a distance.
>     This includes
>      > > >>>>> the current definition of `convergent` in the LangRef.
>      > > >>>>>
>      > > >>>>> We tried _very_ hard to find a workable definition of
>     `convergent`
>      > > >>>>> that doesn't talk about groups of threads in the LangRef, but
>      > > >>>>> ultimately came to the conclusion that thinking about
>     convergent
>      > > >>>>> operations inherently involves threads of execution, and
>     it's much
>      > > >>>>> easier to just accept that. As I've already written
>     above, convergent
>      > > >>>>> operations really are a form of (often non- or
>     inaccessible-memory)
>      > > >>>>> communication among threads where the set of
>     communicating threads is
>      > > >>>>> affected by the static position of the operation in
>     control flow.
>      > > >>>>>
>      > > >>>>> The proposal contains no lock-step execution, even though
>     that's what
>      > > >>>>> many people think of when they think about GPU
>     programming models. It
>      > > >>>>> is purely expressed in terms of communicating threads,
>     which could in
>      > > >>>>> theory just be threads on a CPU. We believe that this is
>     more in line
>      > > >>>>> with how LLVM works. Besides, we found lock-step
>     execution semantics
>      > > >>>>> to be undesirable from first principles, because they
>     prevent certain
>      > > >>>>> desirable program transforms or at least make them much
>     more difficult
>      > > >>>>> to achieve.
>      > > >>>>>
>      > > >>>>> The most recent earlier proposal we brought to the LLVM
>     community is
>      > > >>>>> here: https://reviews.llvm.org/D68994. That one already
>     used a form of
>      > > >>>>> convergence token, but we ultimately rejected it for two
>     main reasons.
>      > > >>>>> First, the convergence control intrinsics described there
>     create more
>      > > >>>>> "noise" in the IR, i.e. there are more intrinsic calls
>     than with what
>      > > >>>>> we have now. Second, the particular formalism of dynamic
>     instances
>      > > >>>>> used there had some undesirable side effects that put
>     undue burden on
>      > > >>>>> transforms that should be trivial like dead-code elimination.
>      > > >>>>>
>      > > >>>>> The current proposal still talks about dynamic instances,
>     but there is
>      > > >>>>> no longer any automatic propagation of convergence: even
>     if two
>      > > >>>>> threads execute the same dynamic instances of an
>     instruction, whether
>      > > >>>>> they execute the same dynamic instance of the very next
>     instruction in
>      > > >>>>> straight-line code is implementation-defined in general
>     (which means
>      > > >>>>> that program transforms are free to make changes that
>     could affect this).
>      > > >>>>>
>      > > >>>>> We chose to use an operand bundle for the convergence
>     token so that
>      > > >>>>> generic code can identify the relevant token value easily. We
>      > > >>>>> considered adding the convergence token as a regular
>     function call
>      > > >>>>> argument, but this makes it impractical for generic
>     transforms to
>      > > >>>>> understand functions that take both a convergence token
>     and some other
>      > > >>>>> token.
>      > > >>>>>
>      > > >>>>> Cheers,
>      > > >>>>> Nicolai
>      > > >>>> --
>      > > >>>> Hal Finkel
>      > > >>>> Lead, Compiler Technology and Programming Languages
>      > > >>>> Leadership Computing Facility
>      > > >>>> Argonne National Laboratory
>      > > >>>>
>      > > >> --
>      > > >> Hal Finkel
>      > > >> Lead, Compiler Technology and Programming Languages
>      > > >> Leadership Computing Facility
>      > > >> Argonne National Laboratory
>      > > >>
>      > > >
>      > > --
>      > > Hal Finkel
>      > > Lead, Compiler Technology and Programming Languages
>      > > Leadership Computing Facility
>      > > Argonne National Laboratory
>      > >
>      >
>      >
>      > --
>      > Lerne, wie die Welt wirklich ist,
>      > aber vergiss niemals, wie sie sein sollte.
> 
> 
> 
>     -- 
>     Lerne, wie die Welt wirklich ist,
>     aber vergiss niemals, wie sie sein sollte.
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.