[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Mon Aug 17 10:14:19 PDT 2020


On 8/17/20 11:51 AM, Nicolai Hähnle wrote:
> Hi Hal,
>
> On Mon, Aug 17, 2020 at 2:13 AM Hal Finkel <hfinkel at anl.gov> wrote:
>> Thanks for sending this. What do you think that we should do with the
>> existing convergent attribute?
> My preference, which is implicitly expressed in the review, is to use
> `convergent` both for the new and the old thing. They are implicitly
> distinguished via the "convergencectrl" operand bundle.
>
> The main reason for going down that route is that `convergent` is
> really an odd duck in that it is the only attribute in LLVM that
> _adds_ new constraints on program transforms. All other attributes
> _remove_ constraints on program transforms by expressing constraints
> on what the function does (e.g. readnone, nosync, willreturn, ...).
> Matt tried to push for changing that, making the "convergent"
> restrictions the default and adding `noconvergent` on most things, but
> that effort got stuck. In the meantime, let's avoid adding yet another
> odd duck like this :)
>
> Cheers,
> Nicolai


Understood. Does the existing attribute have a sound definition, just 
not one that's useful for (all) GPU use cases? Or is the existing 
attribute not really sound and all use cases eventually need to be replaced?

  -Hal


>
>
>>    -Hal
>>
>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
>>> Hi all,
>>>
>>> please see https://reviews.llvm.org/D85603 and its related changes for
>>> our most recent and hopefully final attempt at putting the
>>> `convergent` attribute on a solid theoretical foundation in a way that
>>> is useful for modern GPU compiler use cases. We have clear line of
>>> sight to enabling a new control flow implementation in the AMDGPU
>>> backend which is built on this foundation. I have started upstreaming
>>> some ground work recently. We expect this foundation to also be usable
>>> by non-GPU whole-program vectorization environments, if they choose to
>>> do so.
>>>
>>> What follows is a (not quite) brief overview of what this is all
>>> about, but do see the linked review and its "stack" for a fuller story.
>>>
>>>
>>> The Problem
>>> -----------
>>> `convergent` was originally introduced in LLVM to label "barrier"
>>> instructions in GPU programming languages. Those put some unusual
>>> constraints on program transforms that weren't particularly well
>>> understood at the time. Since then, GPU languages have introduced
>>> additional instructions such as wave shuffles or subgroup reductions
>>> that made the problem harder and exposed some shortcomings of the
>>> current definition of `convergent`. At the same time, those additional
>>> instructions helped illuminate what the problem really is.
>>>
>>> Briefly, in whole-program vectorization environments such as GPUs, we
>>> can expose opportunities for performance optimizations to developers
>>> by allowing data exchange between threads that are grouped together as
>>> lanes in a vector. As long as control flow is uniform, i.e. all
>>> threads of the vector follow the same path through the CFG, it is
>>> natural that all threads assigned to the same vector participate in
>>> this data exchange. When control flow diverges, some threads may be
>>> unavailable for the data exchange, but we still need to allow data
>>> exchange among the threads that _are_ available.
>>>
>>> This means that those data exchange operations -- i.e., `convergent`
>>> operations -- communicate among a set of threads that implicitly
>>> depends on the position of the operation in control flow.
>>>
>>> The problem boils down to defining how the set of communicating
>>> threads is determined.
>>>
>>> This problem inherently has a large target- and environment-specific
>>> component. Even if control flow is fully uniform, there is the
>>> question of which threads are grouped together in the first place. In
>>> the same way that LLVM IR has no built-in understanding of how threads
>>> come to be in the first place (even in the more well-known
>>> multi-threaded CPU programming world where convergent operations don't
>>> exist), we don't concern ourselves at all with these types of
>>> questions here. For the purpose of target-independent LLVM IR
>>> semantics, we simply assume that they are answered _somehow_ and
>>> instead focus on the part of the problem that relates to control flow.
>>>
>>> When looking at high-level language source, it is often "intuitively
>>> obvious" which threads should communicate. This is not the case in an
>>> unstructured CFG. Here's an example of a non-trivial loop where a
>>> reduction is used (the result of the reduction is the sum of the input
>>> values of all participating threads):
>>>
>>>    A:
>>>      br label %B
>>>
>>>    B:
>>>      ...
>>>      %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>>>      ...
>>>      br i1 %cc1, label %B, label %C
>>>
>>>    C:
>>>      ...
>>>      br i1 %cc2, label %B, label %D
>>>
>>>    D:
>>>      ; loop exit
>>>
>>> Suppose this code is executed by two threads grouped in a (very short)
>>> vector, and the threads execute the following sequences of basic blocks:
>>>
>>>>   Thread 1: A B B C D
>>>>   Thread 2: A B C B C D
>>> There are two different intuitive ways of "aligning" the threads for
>>> the purpose of communication between convergent operations:
>>>
>>>>   Align based on natural loops:
>>>>     Thread 1: A B - B C D
>>>>     Thread 2: A B C B C D
>>>>   Align based on a nested loop interpretation:
>>>>     Thread 1: A B B C - - D
>>>>     Thread 2: A B - C B C D
>>> (Explanation of the alignment: In the first alignment, %sum.b is
>>> always the sum of the input values from both threads. In the second
>>> alignment, the first computation of %sum.b is collaborative between
>>> the two threads; the second one in each thread simply returns the
>>> input value for that thread.)
>>>
>>> The example only has a single natural loop, but it could result from a
>>> high-level language source that has two nested do-while loops, so the
>>> second interpretation may well be what the programmer intended.
>>>
>>> It has often been proposed that the alignment should be defined purely
>>> based on the IR shown above, such as by always aligning based on
>>> natural loops. This doesn't work, even if we ignore the possibility of
>>> irreducible control flow. In the example, we could in principle
>>> generate IR as follows if the "nested loop alignment" is desired:
>>>
>>>    A:
>>>      br label %outer.hdr
>>>
>>>    outer.hdr:
>>>      br label %B
>>>
>>>    B:
>>>      ...
>>>      %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>>>      ...
>>>      br i1 %cc1, label %B, label %C
>>>
>>>    C:
>>>      ...
>>>      br i1 %cc2, label %outer.hdr, label %D
>>>
>>>    D:
>>>      ; loop exit
>>>
>>>  From a single-threaded perspective, it would be correct to short-cut
>>> all paths through outer.hdr, but this results in the original IR and
>>> therefore different behavior for the convergent operation (under such
>>> a proposal). Worse, the part of the IR which makes the short-cutting
>>> incorrect -- i.e., the affected convergent operation -- is potentially
>>> far away from the paths that we're short-cutting. We want to avoid
>>> this "spooky action at a distance" because it puts an undue burden on
>>> generic code transforms.
>>>
>>>
>>> The Solution
>>> ------------
>>> We introduce explicit annotations in the IR using "convergence tokens"
>>> that "anchor" a convergent operation with respect to some other point
>>> in control flow or with respect to the function entry. The details are
>>> in the linked Phabricator review, so I will simply illustrate here how
>>> this solves the example above.
>>>
>>> To obtain the original natural loop alignment, we augment the example
>>> as follows:
>>>
>>>    A:
>>>      %anchor = call token @llvm.experimental.convergence.anchor()
>>>      br label %B
>>>
>>>    B:
>>>      %loop = call token @llvm.experimental.convergence.loop() [
>>> "convergencectrl"(token %anchor) ]
>>>      ...
>>>      %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
>>> %loop) ]
>>>      ...
>>>      br i1 %cc1, label %B, label %C
>>>
>>>    C:
>>>      ...
>>>      br i1 %cc2, label %B, label %D
>>>
>>>    D:
>>>      ; loop exit
>>>
>>> To obtain the loop nest alignment, we augment it as follows:
>>>
>>>    A:
>>>      %anchor = call token @llvm.experimental.convergence.anchor()
>>>      br label %outer.hdr
>>>
>>>    outer.hdr:
>>>      %outer = call token @llvm.experimental.convergence.loop() [
>>> "convergencectrl"(token %anchor) ]
>>>      br label %B
>>>
>>>    B:
>>>      %inner = call token @llvm.experimental.convergence.loop() [
>>> "convergencectrl"(token %outer) ]
>>>      ...
>>>      %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
>>> %inner) ]
>>>      ...
>>>      br i1 %cc1, label %B, label %C
>>>
>>>    C:
>>>      ...
>>>      br i1 %cc2, label %outer.hdr, label %D
>>>
>>>    D:
>>>      ; loop exit
>>>
>>> We end up with two nested natural loops as before, but short-cutting
>>> through outer.hdr is easily seen as forbidden because of the "loop
>>> heart" intrinsic along the path through outer.hdr. (We call it a
>>> "heart" because intuitively that's where loop iterations are counted
>>> for the purpose of convergent operations, i.e. we count "heartbeats"
>>> of the loop.) Having the non-removable outer.hdr block may seem like
>>> an inefficiency, but at least in the AMDGPU backend we end up needing
>>> that basic block anyway to implement the desired behavior. We don't
>>> literally count loop iterations in the AMDGPU backend, but we end up
>>> doing something isomorphic that also requires the block. We suspect
>>> that other implementations will have similar requirements. In any
>>> case, backends are free to optimize the block away using
>>> target-specific handling.
>>>
>>>
>>> Have you tried other solutions?
>>> -------------------------------
>>> Yes, we've tried many things over the years. I've personally been
>>> thinking about and working on this problem on and off for a
>>> significant amount of time over the last four years. I supervised a
>>> practical master thesis on a related topic during this time. Other
>>> people have thought about the problem a lot as well. Discussions with
>>> many people in the LLVM community, inside AMD, and elsewhere have all
>>> helped our understanding of the problem. Many different ideas were
>>> considered and ultimately rejected in the process.
>>>
>>> All attempts to solve the problem _without_ some form of "convergence
>>> token" have suffered from spooky action at a distance. This includes
>>> the current definition of `convergent` in the LangRef.
>>>
>>> We tried _very_ hard to find a workable definition of `convergent`
>>> that doesn't talk about groups of threads in the LangRef, but
>>> ultimately came to the conclusion that thinking about convergent
>>> operations inherently involves threads of execution, and it's much
>>> easier to just accept that. As I've already written above, convergent
>>> operations really are a form of (often non- or inaccessible-memory)
>>> communication among threads where the set of communicating threads is
>>> affected by the static position of the operation in control flow.
>>>
>>> The proposal contains no lock-step execution, even though that's what
>>> many people think of when they think about GPU programming models. It
>>> is purely expressed in terms of communicating threads, which could in
>>> theory just be threads on a CPU. We believe that this is more in line
>>> with how LLVM works. Besides, we found lock-step execution semantics
>>> to be undesirable from first principles, because they prevent certain
>>> desirable program transforms or at least make them much more difficult
>>> to achieve.
>>>
>>> The most recent earlier proposal we brought to the LLVM community is
>>> here: https://reviews.llvm.org/D68994. That one already used a form of
>>> convergence token, but we ultimately rejected it for two main reasons.
>>> First, the convergence control intrinsics described there create more
>>> "noise" in the IR, i.e. there are more intrinsic calls than with what
>>> we have now. Second, the particular formalism of dynamic instances
>>> used there had some undesirable side effects that put undue burden on
>>> transforms that should be trivial like dead-code elimination.
>>>
>>> The current proposal still talks about dynamic instances, but there is
>>> no longer any automatic propagation of convergence: even if two
>>> threads execute the same dynamic instances of an instruction, whether
>>> they execute the same dynamic instance of the very next instruction in
>>> straight-line code is implementation-defined in general (which means
>>> that program transforms are free to make changes that could affect this).
>>>
>>> We chose to use an operand bundle for the convergence token so that
>>> generic code can identify the relevant token value easily. We
>>> considered adding the convergence token as a regular function call
>>> argument, but this makes it impractical for generic transforms to
>>> understand functions that take both a convergence token and some other
>>> token.
>>>
>>> Cheers,
>>> Nicolai
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-dev mailing list