[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics

Mon Aug 17 11:29:58 PDT 2020

On 8/17/20 12:52 PM, Nicolai Hähnle wrote:
> On Mon, Aug 17, 2020 at 7:14 PM Hal Finkel <hfinkel at anl.gov> wrote:
>> On 8/17/20 11:51 AM, Nicolai Hähnle wrote:
>>> Hi Hal,
>>>
>>> On Mon, Aug 17, 2020 at 2:13 AM Hal Finkel <hfinkel at anl.gov> wrote:
>>>> Thanks for sending this. What do you think that we should do with the
>>>> existing convergent attribute?
>>> My preference, which is implicitly expressed in the review, is to use
>>> `convergent` both for the new and the old thing. They are implicitly
>>> distinguished via the "convergencectrl" operand bundle.
>>>
>>> The main reason for going down that route is that `convergent` is
>>> really an odd duck in that it is the only attribute in LLVM that
>>> _adds_ new constraints on program transforms. All other attributes
>>> _remove_ constraints on program transforms by expressing constraints
>>> on what the function does (e.g. readnone, nosync, willreturn, ...).
>>> Matt tried to push for changing that, making the "convergent"
>>> restrictions the default and adding `noconvergent` on most things, but
>>> that effort got stuck. In the meantime, let's avoid adding yet another
>>> odd duck like this :)
>>>
>>> Cheers,
>>> Nicolai
>>
>> Understood. Does the existing attribute have a sound definition, just
>> not one that's useful for (all) GPU use cases? Or is the existing
>> attribute not really sound and all use cases eventually need to be replaced?
> The existing definition is insufficient for some GPU use cases, but it
> also has a problem that I like to call spooky action at a distance:
> some program transforms may have to look at code which they'd normally
> not look at in order to correctly determine whether they're correct
> for the current definition of `convergent`. The really problematic one
> that I'm aware of in practice is jump threading -- the implementation
> of that is incorrect wrt `convergent` -- but perhaps others are
> affected as well.
>
> Cheers,
> Nicolai

I agree that the new scheme seems better for several reasons. So it 
seems like your recommendation is that we consider the existing 
attribute usage deprecated? Or should we auto-upgrade it to something?

  -Hal

>
>
>>    -Hal
>>
>>
>>>
>>>>     -Hal
>>>>
>>>> On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
>>>>> Hi all,
>>>>>
>>>>> please see https://reviews.llvm.org/D85603 and its related changes for
>>>>> our most recent and hopefully final attempt at putting the
>>>>> `convergent` attribute on a solid theoretical foundation in a way that
>>>>> is useful for modern GPU compiler use cases. We have clear line of
>>>>> sight to enabling a new control flow implementation in the AMDGPU
>>>>> backend which is built on this foundation. I have started upstreaming
>>>>> some ground work recently. We expect this foundation to also be usable
>>>>> by non-GPU whole-program vectorization environments, if they choose to
>>>>> do so.
>>>>>
>>>>> What follows is a (not quite) brief overview of what this is all
>>>>> about, but do see the linked review and its "stack" for a fuller story.
>>>>>
>>>>>
>>>>> The Problem
>>>>> -----------
>>>>> `convergent` was originally introduced in LLVM to label "barrier"
>>>>> instructions in GPU programming languages. Those put some unusual
>>>>> constraints on program transforms that weren't particularly well
>>>>> understood at the time. Since then, GPU languages have introduced
>>>>> additional instructions such as wave shuffles or subgroup reductions
>>>>> that made the problem harder and exposed some shortcomings of the
>>>>> current definition of `convergent`. At the same time, those additional
>>>>> instructions helped illuminate what the problem really is.
>>>>>
>>>>> Briefly, in whole-program vectorization environments such as GPUs, we
>>>>> can expose opportunities for performance optimizations to developers
>>>>> by allowing data exchange between threads that are grouped together as
>>>>> lanes in a vector. As long as control flow is uniform, i.e. all
>>>>> threads of the vector follow the same path through the CFG, it is
>>>>> natural that all threads assigned to the same vector participate in
>>>>> this data exchange. When control flow diverges, some threads may be
>>>>> unavailable for the data exchange, but we still need to allow data
>>>>> exchange among the threads that _are_ available.
>>>>>
>>>>> This means that those data exchange operations -- i.e., `convergent`
>>>>> operations -- communicate among a set of threads that implicitly
>>>>> depends on the position of the operation in control flow.
>>>>>
>>>>> The problem boils down to defining how the set of communicating
>>>>> threads is determined.
>>>>>
>>>>> This problem inherently has a large target- and environment-specific
>>>>> component. Even if control flow is fully uniform, there is the
>>>>> question of which threads are grouped together in the first place. In
>>>>> the same way that LLVM IR has no built-in understanding of how threads
>>>>> come to be in the first place (even in the more well-known
>>>>> multi-threaded CPU programming world where convergent operations don't
>>>>> exist), we don't concern ourselves at all with these types of
>>>>> questions here. For the purpose of target-independent LLVM IR
>>>>> semantics, we simply assume that they are answered _somehow_ and
>>>>> instead focus on the part of the problem that relates to control flow.
>>>>>
>>>>> When looking at high-level language source, it is often "intuitively
>>>>> obvious" which threads should communicate. This is not the case in an
>>>>> unstructured CFG. Here's an example of a non-trivial loop where a
>>>>> reduction is used (the result of the reduction is the sum of the input
>>>>> values of all participating threads):
>>>>>
>>>>>     A:
>>>>>       br label %B
>>>>>
>>>>>     B:
>>>>>       ...
>>>>>       %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>>>>>       ...
>>>>>       br i1 %cc1, label %B, label %C
>>>>>
>>>>>     C:
>>>>>       ...
>>>>>       br i1 %cc2, label %B, label %D
>>>>>
>>>>>     D:
>>>>>       ; loop exit
>>>>>
>>>>> Suppose this code is executed by two threads grouped in a (very short)
>>>>> vector, and the threads execute the following sequences of basic blocks:
>>>>>
>>>>>>    Thread 1: A B B C D
>>>>>>    Thread 2: A B C B C D
>>>>> There are two different intuitive ways of "aligning" the threads for
>>>>> the purpose of communication between convergent operations:
>>>>>
>>>>>>    Align based on natural loops:
>>>>>>      Thread 1: A B - B C D
>>>>>>      Thread 2: A B C B C D
>>>>>>    Align based on a nested loop interpretation:
>>>>>>      Thread 1: A B B C - - D
>>>>>>      Thread 2: A B - C B C D
>>>>> (Explanation of the alignment: In the first alignment, %sum.b is
>>>>> always the sum of the input values from both threads. In the second
>>>>> alignment, the first computation of %sum.b is collaborative between
>>>>> the two threads; the second one in each thread simply returns the
>>>>> input value for that thread.)
>>>>>
>>>>> The example only has a single natural loop, but it could result from a
>>>>> high-level language source that has two nested do-while loops, so the
>>>>> second interpretation may well be what the programmer intended.
>>>>>
>>>>> It has often been proposed that the alignment should be defined purely
>>>>> based on the IR shown above, such as by always aligning based on
>>>>> natural loops. This doesn't work, even if we ignore the possibility of
>>>>> irreducible control flow. In the example, we could in principle
>>>>> generate IR as follows if the "nested loop alignment" is desired:
>>>>>
>>>>>     A:
>>>>>       br label %outer.hdr
>>>>>
>>>>>     outer.hdr:
>>>>>       br label %B
>>>>>
>>>>>     B:
>>>>>       ...
>>>>>       %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>>>>>       ...
>>>>>       br i1 %cc1, label %B, label %C
>>>>>
>>>>>     C:
>>>>>       ...
>>>>>       br i1 %cc2, label %outer.hdr, label %D
>>>>>
>>>>>     D:
>>>>>       ; loop exit
>>>>>
>>>>>   From a single-threaded perspective, it would be correct to short-cut
>>>>> all paths through outer.hdr, but this results in the original IR and
>>>>> therefore different behavior for the convergent operation (under such
>>>>> a proposal). Worse, the part of the IR which makes the short-cutting
>>>>> incorrect -- i.e., the affected convergent operation -- is potentially
>>>>> far away from the paths that we're short-cutting. We want to avoid
>>>>> this "spooky action at a distance" because it puts an undue burden on
>>>>> generic code transforms.
>>>>>
>>>>>
>>>>> The Solution
>>>>> ------------
>>>>> We introduce explicit annotations in the IR using "convergence tokens"
>>>>> that "anchor" a convergent operation with respect to some other point
>>>>> in control flow or with respect to the function entry. The details are
>>>>> in the linked Phabricator review, so I will simply illustrate here how
>>>>> this solves the example above.
>>>>>
>>>>> To obtain the original natural loop alignment, we augment the example
>>>>> as follows:
>>>>>
>>>>>     A:
>>>>>       %anchor = call token @llvm.experimental.convergence.anchor()
>>>>>       br label %B
>>>>>
>>>>>     B:
>>>>>       %loop = call token @llvm.experimental.convergence.loop() [
>>>>> "convergencectrl"(token %anchor) ]
>>>>>       ...
>>>>>       %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
>>>>> %loop) ]
>>>>>       ...
>>>>>       br i1 %cc1, label %B, label %C
>>>>>
>>>>>     C:
>>>>>       ...
>>>>>       br i1 %cc2, label %B, label %D
>>>>>
>>>>>     D:
>>>>>       ; loop exit
>>>>>
>>>>> To obtain the loop nest alignment, we augment it as follows:
>>>>>
>>>>>     A:
>>>>>       %anchor = call token @llvm.experimental.convergence.anchor()
>>>>>       br label %outer.hdr
>>>>>
>>>>>     outer.hdr:
>>>>>       %outer = call token @llvm.experimental.convergence.loop() [
>>>>> "convergencectrl"(token %anchor) ]
>>>>>       br label %B
>>>>>
>>>>>     B:
>>>>>       %inner = call token @llvm.experimental.convergence.loop() [
>>>>> "convergencectrl"(token %outer) ]
>>>>>       ...
>>>>>       %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token
>>>>> %inner) ]
>>>>>       ...
>>>>>       br i1 %cc1, label %B, label %C
>>>>>
>>>>>     C:
>>>>>       ...
>>>>>       br i1 %cc2, label %outer.hdr, label %D
>>>>>
>>>>>     D:
>>>>>       ; loop exit
>>>>>
>>>>> We end up with two nested natural loops as before, but short-cutting
>>>>> through outer.hdr is easily seen as forbidden because of the "loop
>>>>> heart" intrinsic along the path through outer.hdr. (We call it a
>>>>> "heart" because intuitively that's where loop iterations are counted
>>>>> for the purpose of convergent operations, i.e. we count "heartbeats"
>>>>> of the loop.) Having the non-removable outer.hdr block may seem like
>>>>> an inefficiency, but at least in the AMDGPU backend we end up needing
>>>>> that basic block anyway to implement the desired behavior. We don't
>>>>> literally count loop iterations in the AMDGPU backend, but we end up
>>>>> doing something isomorphic that also requires the block. We suspect
>>>>> that other implementations will have similar requirements. In any
>>>>> case, backends are free to optimize the block away using
>>>>> target-specific handling.
>>>>>
>>>>>
>>>>> Have you tried other solutions?
>>>>> -------------------------------
>>>>> Yes, we've tried many things over the years. I've personally been
>>>>> thinking about and working on this problem on and off for a
>>>>> significant amount of time over the last four years. I supervised a
>>>>> practical master thesis on a related topic during this time. Other
>>>>> people have thought about the problem a lot as well. Discussions with
>>>>> many people in the LLVM community, inside AMD, and elsewhere have all
>>>>> helped our understanding of the problem. Many different ideas were
>>>>> considered and ultimately rejected in the process.
>>>>>
>>>>> All attempts to solve the problem _without_ some form of "convergence
>>>>> token" have suffered from spooky action at a distance. This includes
>>>>> the current definition of `convergent` in the LangRef.
>>>>>
>>>>> We tried _very_ hard to find a workable definition of `convergent`
>>>>> that doesn't talk about groups of threads in the LangRef, but
>>>>> ultimately came to the conclusion that thinking about convergent
>>>>> operations inherently involves threads of execution, and it's much
>>>>> easier to just accept that. As I've already written above, convergent
>>>>> operations really are a form of (often non- or inaccessible-memory)
>>>>> communication among threads where the set of communicating threads is
>>>>> affected by the static position of the operation in control flow.
>>>>>
>>>>> The proposal contains no lock-step execution, even though that's what
>>>>> many people think of when they think about GPU programming models. It
>>>>> is purely expressed in terms of communicating threads, which could in
>>>>> theory just be threads on a CPU. We believe that this is more in line
>>>>> with how LLVM works. Besides, we found lock-step execution semantics
>>>>> to be undesirable from first principles, because they prevent certain
>>>>> desirable program transforms or at least make them much more difficult
>>>>> to achieve.
>>>>>
>>>>> The most recent earlier proposal we brought to the LLVM community is
>>>>> here: https://reviews.llvm.org/D68994. That one already used a form of
>>>>> convergence token, but we ultimately rejected it for two main reasons.
>>>>> First, the convergence control intrinsics described there create more
>>>>> "noise" in the IR, i.e. there are more intrinsic calls than with what
>>>>> we have now. Second, the particular formalism of dynamic instances
>>>>> used there had some undesirable side effects that put undue burden on
>>>>> transforms that should be trivial like dead-code elimination.
>>>>>
>>>>> The current proposal still talks about dynamic instances, but there is
>>>>> no longer any automatic propagation of convergence: even if two
>>>>> threads execute the same dynamic instances of an instruction, whether
>>>>> they execute the same dynamic instance of the very next instruction in
>>>>> straight-line code is implementation-defined in general (which means
>>>>> that program transforms are free to make changes that could affect this).
>>>>>
>>>>> We chose to use an operand bundle for the convergence token so that
>>>>> generic code can identify the relevant token value easily. We
>>>>> considered adding the convergence token as a regular function call
>>>>> argument, but this makes it impractical for generic transforms to
>>>>> understand functions that take both a convergence token and some other
>>>>> token.
>>>>>
>>>>> Cheers,
>>>>> Nicolai
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory