[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics
Nicolai Hähnle via llvm-dev
llvm-dev at lists.llvm.org
Fri Oct 30 05:08:57 PDT 2020
On 29.10.20 23:33, Reid Kleckner wrote:
> I watched the talk on the design, and it all made sense to me. I can't
> claim to have a deep knowledge of the requirements of GPU architectures,
> but I can say that this is basically the kind of stuff we had in mind
> when the token type was designed. What you are saying about modelling
> these special GPU operations as accessing inaccessible memory makes
> sense to me, but again, I am not an expert.
>
> One of the challenges we've faced when trying to create regions for
> unpacked call sequences[1] is that unreachable code elimination can
> often "break" a region by deleting the token consumer. It's not clear if
> your proposal suffers from this problem, but it's something to keep in
> mind, since optimizers can often discover that a plain store is storing
> to a null pointer, that can become unreachable, and there goes the
> entire rest of the function, leaving you with a half-open region. I
> can't imagine a plausible series of transforms on a reasonable GPU
> program would end up in this situation, though.
> [1] http://lists.llvm.org/pipermail/llvm-dev/2020-January/138627.html
Thanks for the feedback, Reid. I try not to make assumptions about what
is a "reasonable" GPU program ;)
One advantage that we have here is that there is no need to explicitly
"end" a token region. Rather, we have a single token producer and 0 to
any number of consumers. Deleting a consumer doesn't make a difference
for the semantic guarantees we make for any of the non-deleted code. At
least, I've thought about this quite a fair bit, including some very
incomplete proof sketches, and I'm pretty confident about it at this point.
For what it's worth, `unreachable` and multiple return statements are
things we see in GPU code in practice, and simple support of dead-code
elimination was an explicit goal.
One interesting kind of thing can happen and needs to be kept in mind
when the `anchor` intrinsic is used, because this intrinsic is largely
implementation-defined. If you have:
%tok = call token @llvm.experimental.convergence.anchor()
... lots of code with control flow ...
call void @convergent_op() [ "convergencectrl"(token %tok) ]
...
%tok2 = call token @llvm.experiment.convergence.anchor()
call void @second_op() [ "convergencectrl"(token %tok2) ]
Deleting the @convergent_op could potentially cause a different set of
threads to arrive together at the second anchor and for the @second_op
during execution in practice. Whether the deletion makes a difference or
not depends on the details of the underlying divergence / reconvergence
mechanisms.
That's okay though, because it just ends up being a different refinement
of the original IR semantics. Allowing this kind of freedom (even
including non-determinism) is exactly the point of the `anchor`
intrinsic, and if the programmer/frontend didn't want that, some
alternative structure (typically based on `entry`) would have to be used
instead.
Cheers,
Nicolai
>
> On Wed, Oct 28, 2020 at 3:14 PM Nicolai Hähnle via llvm-dev
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>
> Hi all,
>
> As there has been quiet on the topic for quite some time now, I'm
> pinging this thread to see if ideally we can just go ahead with the
> current proposal, or if/how more discussion is required.
>
> Quick pointer to the proposed LangRef changes for the `convergent`
> attribute is here: https://reviews.llvm.org/D85603
>
> The presentation from the dev meeting three weeks ago is here:
> https://www.youtube.com/watch?v=_Z5DuiVCFAw
>
> Thanks,
> Nicolai
>
>
> On Tue, Aug 18, 2020 at 10:44 AM Nicolai Hähnle <nhaehnle at gmail.com
> <mailto:nhaehnle at gmail.com>> wrote:
> >
> > Hi Hal,
> >
> > On Mon, Aug 17, 2020 at 8:30 PM Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
> > > I agree that the new scheme seems better for several reasons. So it
> > > seems like your recommendation is that we consider the existing
> > > attribute usage deprecated?
> >
> > Yes.
> >
> > Some form of auto-upgrade could be attempted via the
> > ConvergenceControlHeuristic of https://reviews.llvm.org/D85609,
> though
> > that's slightly big for on-the-fly upgrading in the bitcode
> reader. So
> > while the groundwork for potentially doing it is there, I don't see a
> > strong need or desire for it at the moment.
> >
> > Cheers,
> > Nicolai
> >
> >
> > > Or should we auto-upgrade it to something?
> > >
> > > -Hal
> > >
> > >
> > > >
> > > >
> > > >> -Hal
> > > >>
> > > >>
> > > >>>
> > > >>>> -Hal
> > > >>>>
> > > >>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
> > > >>>>> Hi all,
> > > >>>>>
> > > >>>>> please see https://reviews.llvm.org/D85603 and its
> related changes for
> > > >>>>> our most recent and hopefully final attempt at putting the
> > > >>>>> `convergent` attribute on a solid theoretical foundation
> in a way that
> > > >>>>> is useful for modern GPU compiler use cases. We have
> clear line of
> > > >>>>> sight to enabling a new control flow implementation in
> the AMDGPU
> > > >>>>> backend which is built on this foundation. I have started
> upstreaming
> > > >>>>> some ground work recently. We expect this foundation to
> also be usable
> > > >>>>> by non-GPU whole-program vectorization environments, if
> they choose to
> > > >>>>> do so.
> > > >>>>>
> > > >>>>> What follows is a (not quite) brief overview of what this
> is all
> > > >>>>> about, but do see the linked review and its "stack" for a
> fuller story.
> > > >>>>>
> > > >>>>>
> > > >>>>> The Problem
> > > >>>>> -----------
> > > >>>>> `convergent` was originally introduced in LLVM to label
> "barrier"
> > > >>>>> instructions in GPU programming languages. Those put some
> unusual
> > > >>>>> constraints on program transforms that weren't
> particularly well
> > > >>>>> understood at the time. Since then, GPU languages have
> introduced
> > > >>>>> additional instructions such as wave shuffles or subgroup
> reductions
> > > >>>>> that made the problem harder and exposed some
> shortcomings of the
> > > >>>>> current definition of `convergent`. At the same time,
> those additional
> > > >>>>> instructions helped illuminate what the problem really is.
> > > >>>>>
> > > >>>>> Briefly, in whole-program vectorization environments such
> as GPUs, we
> > > >>>>> can expose opportunities for performance optimizations to
> developers
> > > >>>>> by allowing data exchange between threads that are
> grouped together as
> > > >>>>> lanes in a vector. As long as control flow is uniform,
> i.e. all
> > > >>>>> threads of the vector follow the same path through the
> CFG, it is
> > > >>>>> natural that all threads assigned to the same vector
> participate in
> > > >>>>> this data exchange. When control flow diverges, some
> threads may be
> > > >>>>> unavailable for the data exchange, but we still need to
> allow data
> > > >>>>> exchange among the threads that _are_ available.
> > > >>>>>
> > > >>>>> This means that those data exchange operations -- i.e.,
> `convergent`
> > > >>>>> operations -- communicate among a set of threads that
> implicitly
> > > >>>>> depends on the position of the operation in control flow.
> > > >>>>>
> > > >>>>> The problem boils down to defining how the set of
> communicating
> > > >>>>> threads is determined.
> > > >>>>>
> > > >>>>> This problem inherently has a large target- and
> environment-specific
> > > >>>>> component. Even if control flow is fully uniform, there
> is the
> > > >>>>> question of which threads are grouped together in the
> first place. In
> > > >>>>> the same way that LLVM IR has no built-in understanding
> of how threads
> > > >>>>> come to be in the first place (even in the more well-known
> > > >>>>> multi-threaded CPU programming world where convergent
> operations don't
> > > >>>>> exist), we don't concern ourselves at all with these types of
> > > >>>>> questions here. For the purpose of target-independent LLVM IR
> > > >>>>> semantics, we simply assume that they are answered
> _somehow_ and
> > > >>>>> instead focus on the part of the problem that relates to
> control flow.
> > > >>>>>
> > > >>>>> When looking at high-level language source, it is often
> "intuitively
> > > >>>>> obvious" which threads should communicate. This is not
> the case in an
> > > >>>>> unstructured CFG. Here's an example of a non-trivial loop
> where a
> > > >>>>> reduction is used (the result of the reduction is the sum
> of the input
> > > >>>>> values of all participating threads):
> > > >>>>>
> > > >>>>> A:
> > > >>>>> br label %B
> > > >>>>>
> > > >>>>> B:
> > > >>>>> ...
> > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
> > > >>>>> ...
> > > >>>>> br i1 %cc1, label %B, label %C
> > > >>>>>
> > > >>>>> C:
> > > >>>>> ...
> > > >>>>> br i1 %cc2, label %B, label %D
> > > >>>>>
> > > >>>>> D:
> > > >>>>> ; loop exit
> > > >>>>>
> > > >>>>> Suppose this code is executed by two threads grouped in a
> (very short)
> > > >>>>> vector, and the threads execute the following sequences
> of basic blocks:
> > > >>>>>
> > > >>>>>> Thread 1: A B B C D
> > > >>>>>> Thread 2: A B C B C D
> > > >>>>> There are two different intuitive ways of "aligning" the
> threads for
> > > >>>>> the purpose of communication between convergent operations:
> > > >>>>>
> > > >>>>>> Align based on natural loops:
> > > >>>>>> Thread 1: A B - B C D
> > > >>>>>> Thread 2: A B C B C D
> > > >>>>>> Align based on a nested loop interpretation:
> > > >>>>>> Thread 1: A B B C - - D
> > > >>>>>> Thread 2: A B - C B C D
> > > >>>>> (Explanation of the alignment: In the first alignment,
> %sum.b is
> > > >>>>> always the sum of the input values from both threads. In
> the second
> > > >>>>> alignment, the first computation of %sum.b is
> collaborative between
> > > >>>>> the two threads; the second one in each thread simply
> returns the
> > > >>>>> input value for that thread.)
> > > >>>>>
> > > >>>>> The example only has a single natural loop, but it could
> result from a
> > > >>>>> high-level language source that has two nested do-while
> loops, so the
> > > >>>>> second interpretation may well be what the programmer
> intended.
> > > >>>>>
> > > >>>>> It has often been proposed that the alignment should be
> defined purely
> > > >>>>> based on the IR shown above, such as by always aligning
> based on
> > > >>>>> natural loops. This doesn't work, even if we ignore the
> possibility of
> > > >>>>> irreducible control flow. In the example, we could in
> principle
> > > >>>>> generate IR as follows if the "nested loop alignment" is
> desired:
> > > >>>>>
> > > >>>>> A:
> > > >>>>> br label %outer.hdr
> > > >>>>>
> > > >>>>> outer.hdr:
> > > >>>>> br label %B
> > > >>>>>
> > > >>>>> B:
> > > >>>>> ...
> > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) ; convergent
> > > >>>>> ...
> > > >>>>> br i1 %cc1, label %B, label %C
> > > >>>>>
> > > >>>>> C:
> > > >>>>> ...
> > > >>>>> br i1 %cc2, label %outer.hdr, label %D
> > > >>>>>
> > > >>>>> D:
> > > >>>>> ; loop exit
> > > >>>>>
> > > >>>>> From a single-threaded perspective, it would be correct
> to short-cut
> > > >>>>> all paths through outer.hdr, but this results in the
> original IR and
> > > >>>>> therefore different behavior for the convergent operation
> (under such
> > > >>>>> a proposal). Worse, the part of the IR which makes the
> short-cutting
> > > >>>>> incorrect -- i.e., the affected convergent operation --
> is potentially
> > > >>>>> far away from the paths that we're short-cutting. We want
> to avoid
> > > >>>>> this "spooky action at a distance" because it puts an
> undue burden on
> > > >>>>> generic code transforms.
> > > >>>>>
> > > >>>>>
> > > >>>>> The Solution
> > > >>>>> ------------
> > > >>>>> We introduce explicit annotations in the IR using
> "convergence tokens"
> > > >>>>> that "anchor" a convergent operation with respect to some
> other point
> > > >>>>> in control flow or with respect to the function entry.
> The details are
> > > >>>>> in the linked Phabricator review, so I will simply
> illustrate here how
> > > >>>>> this solves the example above.
> > > >>>>>
> > > >>>>> To obtain the original natural loop alignment, we augment
> the example
> > > >>>>> as follows:
> > > >>>>>
> > > >>>>> A:
> > > >>>>> %anchor = call token
> @llvm.experimental.convergence.anchor()
> > > >>>>> br label %B
> > > >>>>>
> > > >>>>> B:
> > > >>>>> %loop = call token
> @llvm.experimental.convergence.loop() [
> > > >>>>> "convergencectrl"(token %anchor) ]
> > > >>>>> ...
> > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [
> "convergencectrl"(token
> > > >>>>> %loop) ]
> > > >>>>> ...
> > > >>>>> br i1 %cc1, label %B, label %C
> > > >>>>>
> > > >>>>> C:
> > > >>>>> ...
> > > >>>>> br i1 %cc2, label %B, label %D
> > > >>>>>
> > > >>>>> D:
> > > >>>>> ; loop exit
> > > >>>>>
> > > >>>>> To obtain the loop nest alignment, we augment it as follows:
> > > >>>>>
> > > >>>>> A:
> > > >>>>> %anchor = call token
> @llvm.experimental.convergence.anchor()
> > > >>>>> br label %outer.hdr
> > > >>>>>
> > > >>>>> outer.hdr:
> > > >>>>> %outer = call token
> @llvm.experimental.convergence.loop() [
> > > >>>>> "convergencectrl"(token %anchor) ]
> > > >>>>> br label %B
> > > >>>>>
> > > >>>>> B:
> > > >>>>> %inner = call token
> @llvm.experimental.convergence.loop() [
> > > >>>>> "convergencectrl"(token %outer) ]
> > > >>>>> ...
> > > >>>>> %sum.b = call i32 @subgroupAdd(i32 %v) [
> "convergencectrl"(token
> > > >>>>> %inner) ]
> > > >>>>> ...
> > > >>>>> br i1 %cc1, label %B, label %C
> > > >>>>>
> > > >>>>> C:
> > > >>>>> ...
> > > >>>>> br i1 %cc2, label %outer.hdr, label %D
> > > >>>>>
> > > >>>>> D:
> > > >>>>> ; loop exit
> > > >>>>>
> > > >>>>> We end up with two nested natural loops as before, but
> short-cutting
> > > >>>>> through outer.hdr is easily seen as forbidden because of
> the "loop
> > > >>>>> heart" intrinsic along the path through outer.hdr. (We
> call it a
> > > >>>>> "heart" because intuitively that's where loop iterations
> are counted
> > > >>>>> for the purpose of convergent operations, i.e. we count
> "heartbeats"
> > > >>>>> of the loop.) Having the non-removable outer.hdr block
> may seem like
> > > >>>>> an inefficiency, but at least in the AMDGPU backend we
> end up needing
> > > >>>>> that basic block anyway to implement the desired
> behavior. We don't
> > > >>>>> literally count loop iterations in the AMDGPU backend,
> but we end up
> > > >>>>> doing something isomorphic that also requires the block.
> We suspect
> > > >>>>> that other implementations will have similar
> requirements. In any
> > > >>>>> case, backends are free to optimize the block away using
> > > >>>>> target-specific handling.
> > > >>>>>
> > > >>>>>
> > > >>>>> Have you tried other solutions?
> > > >>>>> -------------------------------
> > > >>>>> Yes, we've tried many things over the years. I've
> personally been
> > > >>>>> thinking about and working on this problem on and off for a
> > > >>>>> significant amount of time over the last four years. I
> supervised a
> > > >>>>> practical master thesis on a related topic during this
> time. Other
> > > >>>>> people have thought about the problem a lot as well.
> Discussions with
> > > >>>>> many people in the LLVM community, inside AMD, and
> elsewhere have all
> > > >>>>> helped our understanding of the problem. Many different
> ideas were
> > > >>>>> considered and ultimately rejected in the process.
> > > >>>>>
> > > >>>>> All attempts to solve the problem _without_ some form of
> "convergence
> > > >>>>> token" have suffered from spooky action at a distance.
> This includes
> > > >>>>> the current definition of `convergent` in the LangRef.
> > > >>>>>
> > > >>>>> We tried _very_ hard to find a workable definition of
> `convergent`
> > > >>>>> that doesn't talk about groups of threads in the LangRef, but
> > > >>>>> ultimately came to the conclusion that thinking about
> convergent
> > > >>>>> operations inherently involves threads of execution, and
> it's much
> > > >>>>> easier to just accept that. As I've already written
> above, convergent
> > > >>>>> operations really are a form of (often non- or
> inaccessible-memory)
> > > >>>>> communication among threads where the set of
> communicating threads is
> > > >>>>> affected by the static position of the operation in
> control flow.
> > > >>>>>
> > > >>>>> The proposal contains no lock-step execution, even though
> that's what
> > > >>>>> many people think of when they think about GPU
> programming models. It
> > > >>>>> is purely expressed in terms of communicating threads,
> which could in
> > > >>>>> theory just be threads on a CPU. We believe that this is
> more in line
> > > >>>>> with how LLVM works. Besides, we found lock-step
> execution semantics
> > > >>>>> to be undesirable from first principles, because they
> prevent certain
> > > >>>>> desirable program transforms or at least make them much
> more difficult
> > > >>>>> to achieve.
> > > >>>>>
> > > >>>>> The most recent earlier proposal we brought to the LLVM
> community is
> > > >>>>> here: https://reviews.llvm.org/D68994. That one already
> used a form of
> > > >>>>> convergence token, but we ultimately rejected it for two
> main reasons.
> > > >>>>> First, the convergence control intrinsics described there
> create more
> > > >>>>> "noise" in the IR, i.e. there are more intrinsic calls
> than with what
> > > >>>>> we have now. Second, the particular formalism of dynamic
> instances
> > > >>>>> used there had some undesirable side effects that put
> undue burden on
> > > >>>>> transforms that should be trivial like dead-code elimination.
> > > >>>>>
> > > >>>>> The current proposal still talks about dynamic instances,
> but there is
> > > >>>>> no longer any automatic propagation of convergence: even
> if two
> > > >>>>> threads execute the same dynamic instances of an
> instruction, whether
> > > >>>>> they execute the same dynamic instance of the very next
> instruction in
> > > >>>>> straight-line code is implementation-defined in general
> (which means
> > > >>>>> that program transforms are free to make changes that
> could affect this).
> > > >>>>>
> > > >>>>> We chose to use an operand bundle for the convergence
> token so that
> > > >>>>> generic code can identify the relevant token value easily. We
> > > >>>>> considered adding the convergence token as a regular
> function call
> > > >>>>> argument, but this makes it impractical for generic
> transforms to
> > > >>>>> understand functions that take both a convergence token
> and some other
> > > >>>>> token.
> > > >>>>>
> > > >>>>> Cheers,
> > > >>>>> Nicolai
> > > >>>> --
> > > >>>> Hal Finkel
> > > >>>> Lead, Compiler Technology and Programming Languages
> > > >>>> Leadership Computing Facility
> > > >>>> Argonne National Laboratory
> > > >>>>
> > > >> --
> > > >> Hal Finkel
> > > >> Lead, Compiler Technology and Programming Languages
> > > >> Leadership Computing Facility
> > > >> Argonne National Laboratory
> > > >>
> > > >
> > > --
> > > Hal Finkel
> > > Lead, Compiler Technology and Programming Languages
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> >
> >
> > --
> > Lerne, wie die Welt wirklich ist,
> > aber vergiss niemals, wie sie sein sollte.
>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> aber vergiss niemals, wie sie sein sollte.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
More information about the llvm-dev
mailing list