[llvm-dev] [RFC] Introducing convergence control bundles and intrinsics

Sun Aug 16 17:13:30 PDT 2020

Hi, Nicolai,

Thanks for sending this. What do you think that we should do with the 
existing convergent attribute?

  -Hal

On 8/9/20 10:03 AM, Nicolai Hähnle via llvm-dev wrote:
> Hi all,
>
> please see https://reviews.llvm.org/D85603 and its related changes for 
> our most recent and hopefully final attempt at putting the 
> `convergent` attribute on a solid theoretical foundation in a way that 
> is useful for modern GPU compiler use cases. We have clear line of 
> sight to enabling a new control flow implementation in the AMDGPU 
> backend which is built on this foundation. I have started upstreaming 
> some ground work recently. We expect this foundation to also be usable 
> by non-GPU whole-program vectorization environments, if they choose to 
> do so.
>
> What follows is a (not quite) brief overview of what this is all 
> about, but do see the linked review and its "stack" for a fuller story.
>
>
> The Problem
> -----------
> `convergent` was originally introduced in LLVM to label "barrier" 
> instructions in GPU programming languages. Those put some unusual 
> constraints on program transforms that weren't particularly well 
> understood at the time. Since then, GPU languages have introduced 
> additional instructions such as wave shuffles or subgroup reductions 
> that made the problem harder and exposed some shortcomings of the 
> current definition of `convergent`. At the same time, those additional 
> instructions helped illuminate what the problem really is.
>
> Briefly, in whole-program vectorization environments such as GPUs, we 
> can expose opportunities for performance optimizations to developers 
> by allowing data exchange between threads that are grouped together as 
> lanes in a vector. As long as control flow is uniform, i.e. all 
> threads of the vector follow the same path through the CFG, it is 
> natural that all threads assigned to the same vector participate in 
> this data exchange. When control flow diverges, some threads may be 
> unavailable for the data exchange, but we still need to allow data 
> exchange among the threads that _are_ available.
>
> This means that those data exchange operations -- i.e., `convergent` 
> operations -- communicate among a set of threads that implicitly 
> depends on the position of the operation in control flow.
>
> The problem boils down to defining how the set of communicating 
> threads is determined.
>
> This problem inherently has a large target- and environment-specific 
> component. Even if control flow is fully uniform, there is the 
> question of which threads are grouped together in the first place. In 
> the same way that LLVM IR has no built-in understanding of how threads 
> come to be in the first place (even in the more well-known 
> multi-threaded CPU programming world where convergent operations don't 
> exist), we don't concern ourselves at all with these types of 
> questions here. For the purpose of target-independent LLVM IR 
> semantics, we simply assume that they are answered _somehow_ and 
> instead focus on the part of the problem that relates to control flow.
>
> When looking at high-level language source, it is often "intuitively 
> obvious" which threads should communicate. This is not the case in an 
> unstructured CFG. Here's an example of a non-trivial loop where a 
> reduction is used (the result of the reduction is the sum of the input 
> values of all participating threads):
>
>   A:
>     br label %B
>
>   B:
>     ...
>     %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>     ...
>     br i1 %cc1, label %B, label %C
>
>   C:
>     ...
>     br i1 %cc2, label %B, label %D
>
>   D:
>     ; loop exit
>
> Suppose this code is executed by two threads grouped in a (very short) 
> vector, and the threads execute the following sequences of basic blocks:
>
> >  Thread 1: A B B C D
> >  Thread 2: A B C B C D
>
> There are two different intuitive ways of "aligning" the threads for 
> the purpose of communication between convergent operations:
>
> >  Align based on natural loops:
> >    Thread 1: A B - B C D
> >    Thread 2: A B C B C D
>
> >  Align based on a nested loop interpretation:
> >    Thread 1: A B B C - - D
> >    Thread 2: A B - C B C D
>
> (Explanation of the alignment: In the first alignment, %sum.b is 
> always the sum of the input values from both threads. In the second 
> alignment, the first computation of %sum.b is collaborative between 
> the two threads; the second one in each thread simply returns the 
> input value for that thread.)
>
> The example only has a single natural loop, but it could result from a 
> high-level language source that has two nested do-while loops, so the 
> second interpretation may well be what the programmer intended.
>
> It has often been proposed that the alignment should be defined purely 
> based on the IR shown above, such as by always aligning based on 
> natural loops. This doesn't work, even if we ignore the possibility of 
> irreducible control flow. In the example, we could in principle 
> generate IR as follows if the "nested loop alignment" is desired:
>
>   A:
>     br label %outer.hdr
>
>   outer.hdr:
>     br label %B
>
>   B:
>     ...
>     %sum.b = call i32 @subgroupAdd(i32 %v)  ; convergent
>     ...
>     br i1 %cc1, label %B, label %C
>
>   C:
>     ...
>     br i1 %cc2, label %outer.hdr, label %D
>
>   D:
>     ; loop exit
>
> From a single-threaded perspective, it would be correct to short-cut 
> all paths through outer.hdr, but this results in the original IR and 
> therefore different behavior for the convergent operation (under such 
> a proposal). Worse, the part of the IR which makes the short-cutting 
> incorrect -- i.e., the affected convergent operation -- is potentially 
> far away from the paths that we're short-cutting. We want to avoid 
> this "spooky action at a distance" because it puts an undue burden on 
> generic code transforms.
>
>
> The Solution
> ------------
> We introduce explicit annotations in the IR using "convergence tokens" 
> that "anchor" a convergent operation with respect to some other point 
> in control flow or with respect to the function entry. The details are 
> in the linked Phabricator review, so I will simply illustrate here how 
> this solves the example above.
>
> To obtain the original natural loop alignment, we augment the example 
> as follows:
>
>   A:
>     %anchor = call token @llvm.experimental.convergence.anchor()
>     br label %B
>
>   B:
>     %loop = call token @llvm.experimental.convergence.loop() [ 
> "convergencectrl"(token %anchor) ]
>     ...
>     %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token 
> %loop) ]
>     ...
>     br i1 %cc1, label %B, label %C
>
>   C:
>     ...
>     br i1 %cc2, label %B, label %D
>
>   D:
>     ; loop exit
>
> To obtain the loop nest alignment, we augment it as follows:
>
>   A:
>     %anchor = call token @llvm.experimental.convergence.anchor()
>     br label %outer.hdr
>
>   outer.hdr:
>     %outer = call token @llvm.experimental.convergence.loop() [ 
> "convergencectrl"(token %anchor) ]
>     br label %B
>
>   B:
>     %inner = call token @llvm.experimental.convergence.loop() [ 
> "convergencectrl"(token %outer) ]
>     ...
>     %sum.b = call i32 @subgroupAdd(i32 %v) [ "convergencectrl"(token 
> %inner) ]
>     ...
>     br i1 %cc1, label %B, label %C
>
>   C:
>     ...
>     br i1 %cc2, label %outer.hdr, label %D
>
>   D:
>     ; loop exit
>
> We end up with two nested natural loops as before, but short-cutting 
> through outer.hdr is easily seen as forbidden because of the "loop 
> heart" intrinsic along the path through outer.hdr. (We call it a 
> "heart" because intuitively that's where loop iterations are counted 
> for the purpose of convergent operations, i.e. we count "heartbeats" 
> of the loop.) Having the non-removable outer.hdr block may seem like 
> an inefficiency, but at least in the AMDGPU backend we end up needing 
> that basic block anyway to implement the desired behavior. We don't 
> literally count loop iterations in the AMDGPU backend, but we end up 
> doing something isomorphic that also requires the block. We suspect 
> that other implementations will have similar requirements. In any 
> case, backends are free to optimize the block away using 
> target-specific handling.
>
>
> Have you tried other solutions?
> -------------------------------
> Yes, we've tried many things over the years. I've personally been 
> thinking about and working on this problem on and off for a 
> significant amount of time over the last four years. I supervised a 
> practical master thesis on a related topic during this time. Other 
> people have thought about the problem a lot as well. Discussions with 
> many people in the LLVM community, inside AMD, and elsewhere have all 
> helped our understanding of the problem. Many different ideas were 
> considered and ultimately rejected in the process.
>
> All attempts to solve the problem _without_ some form of "convergence 
> token" have suffered from spooky action at a distance. This includes 
> the current definition of `convergent` in the LangRef.
>
> We tried _very_ hard to find a workable definition of `convergent` 
> that doesn't talk about groups of threads in the LangRef, but 
> ultimately came to the conclusion that thinking about convergent 
> operations inherently involves threads of execution, and it's much 
> easier to just accept that. As I've already written above, convergent 
> operations really are a form of (often non- or inaccessible-memory) 
> communication among threads where the set of communicating threads is 
> affected by the static position of the operation in control flow.
>
> The proposal contains no lock-step execution, even though that's what 
> many people think of when they think about GPU programming models. It 
> is purely expressed in terms of communicating threads, which could in 
> theory just be threads on a CPU. We believe that this is more in line 
> with how LLVM works. Besides, we found lock-step execution semantics 
> to be undesirable from first principles, because they prevent certain 
> desirable program transforms or at least make them much more difficult 
> to achieve.
>
> The most recent earlier proposal we brought to the LLVM community is 
> here: https://reviews.llvm.org/D68994. That one already used a form of 
> convergence token, but we ultimately rejected it for two main reasons. 
> First, the convergence control intrinsics described there create more 
> "noise" in the IR, i.e. there are more intrinsic calls than with what 
> we have now. Second, the particular formalism of dynamic instances 
> used there had some undesirable side effects that put undue burden on 
> transforms that should be trivial like dead-code elimination.
>
> The current proposal still talks about dynamic instances, but there is 
> no longer any automatic propagation of convergence: even if two 
> threads execute the same dynamic instances of an instruction, whether 
> they execute the same dynamic instance of the very next instruction in 
> straight-line code is implementation-defined in general (which means 
> that program transforms are free to make changes that could affect this).
>
> We chose to use an operand bundle for the convergence token so that 
> generic code can identify the relevant token value easily. We 
> considered adding the convergence token as a regular function call 
> argument, but this makes it impractical for generic transforms to 
> understand functions that take both a convergence token and some other 
> token.
>
> Cheers,
> Nicolai

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory