[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Wed Aug 12 12:48:33 PDT 2020

nhaehnle added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:52-55
+The definitions in this document leave many details open, such as how groups of
+threads are formed in the first place. It focuses on the questions that are
+relevant for deciding the correctness of generic program transforms and
+convergence-related analyses such as divergence analysis.
----------------
sameerds wrote:
> I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular:
> 
>   # In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence.
>   # In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside.
>   # In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.
> 
> This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?
> 
> 3.    In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.
>
> This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?

What ambiguities do you have in mind?

If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order.

================
Comment at: llvm/docs/ConvergentOperations.rst:280
+We generally take the stance that reconvergence in acyclic control flow must
+be maximal. The compiler frontend could augment the original code as follows:
+
----------------
sameerds wrote:
> It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence? 
No. The jump-threaded code could also come out of C(++) code with `goto`s, so this doesn't really work.

================
Comment at: llvm/docs/ConvergentOperations.rst:551-553
+This intrinsic defines the *heart* of a loop, i.e. the place where an imaginary
+loop counter is incremented for the purpose of determining convergence
+semantics.
----------------
sameerds wrote:
> So the heart is not a property of the loop itself in LLVM IR. It is a place chosen by the frontend based on semantics external to LLVM IR, in a way that allows the frontend to express constraints about convergence in the loop.
Yes.

================
Comment at: llvm/docs/ConvergentOperations.rst:570
+
+  token @llvm.experimental.convergence.anchor() convergent readnone
+
----------------
sameerds wrote:
> Just like the loop intrinsic, this intrinsic occurs in a place chosen by the frontend based on semantics outside of LLVM IR, and used by the frontend to express constraints elsewhere in the IR.
I'd rephrase it slightly by saying that the place is chosen by the frontend in a way that preserves the semantics of the original language into LLVM IR. But I suspect that we're ultimately thinking of the same thing.

================
Comment at: llvm/docs/ConvergentOperations.rst:611-612
+        dynamic instance of the defining instruction, and
+     2. There is an *n* such that both threads execute U for the *n*'th time
+        with that same token operand value.
+
----------------
sameerds wrote:
> The older comments about this seem to have floated away. At risk of repeating the discussion, what is *n* capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example?
It's really just a loop iteration counter. Every time a thread executes the `loop` intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label.

Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the `loop` intrinsic. This means that if you have a natural loop where the `loop` intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere).

Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration.

================
Comment at: llvm/docs/ConvergentOperations.rst:653-656
+3. The *convergence region* of a convergence token T is the minimal region in
+   which T is live and used (i.e., the program points dominated by the
+   definition D of T from which a use of T can be reached without leaving the
+   region dominated by D).
----------------
sameerds wrote:
> This is not a rule; it's just a definition.
Fair enough. I'm going to split this up into rules about cycles and rules about convergence regions.

================
Comment at: llvm/docs/ConvergentOperations.rst:658-660
+4. If a convergence region contains a use of a convergence token, then it must
+   also contain its definition. (In other words, convergence regions must be
+   reasonably nested.)
----------------
sameerds wrote:
> Since a convergence region is defined for a token, this text needs to bring out the fact that two different tokens are being talked about at this point. Something like: If the convergence region for token T1 contains a use of another token T2, then it must also contain the definition of T2."
It's needed from a formal point of view, but it does seem to trip people up, so I'm going to implement your suggestion :)

================
Comment at: llvm/docs/ConvergentOperations.rst:749-754
+  %outer = call token @llvm.experimental.convergence.anchor()
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
+    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
+    counter--;
+  }
----------------
sameerds wrote:
> So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1?
Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't.

================
Comment at: llvm/docs/ConvergentOperations.rst:759
+if the loop counter is known to be a multiple, then unrolling is allowed,
+though care must be taken to correct the use of the loop intrinsic.
+For example, unrolling by 2:
----------------
sameerds wrote:
> Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop().
Correct.

I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution.

================
Comment at: llvm/docs/ConvergentOperations.rst:547-551
+  }
+
+The behavior is unchanged, since each of the static convergent operations only
+ever communicates with threads that have the same ``condition`` value.
+By contrast, hoisting the convergent operations themselves is forbidden.
----------------
sameerds wrote:
> nhaehnle wrote:
> > sameerds wrote:
> > > t-tye wrote:
> > > > So the convergent token is the set of threads, but any intervening conditional control flow may change which threads a nested convergent operation may be required to communicate with?
> > > > 
> > > > My understanding was that the tokens were intended to be explicit in denoting the involved threads to avoid needing to crawl the LLVM IR to determine the control dependence. And were intended to be explicit in preventing control dependence changes. But these examples seem to contradict that understanding.
> > > > 
> > > > So when a convergent token is used in a dynamic instance of a static convergent operation, what set of threads is it mandating have to participate? Those defined by the dynamic instance of the static token definition that control dependence permits to execute?
> > > This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads.
> > CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.
> Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch.
I don't understand what you're trying to get at here.

The semantics of modern CUDA builtins are fully captured by saying they're non-convergent, but they have a side effect. That side effect is communication with some set of other threads, but that set isn't affected by control flow, it's fully specified by an explicit argument. Because of this, there is no need to argue about dynamic instances.

All legal program transforms subject to those constraints are then legal. There is no need to label them as `convergent`. If you can think of a counter-example, I'd be curious to see it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603