[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Tue Aug 11 08:08:02 PDT 2020

nhaehnle added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:280
+:ref:`Formal Rules <convergence_formal_rules>` section for details.
+
+
----------------
jdoerfert wrote:
> The "heart" and the increment step are fairly vague. Maybe talk about something tangible, e.g., the target of a backedge?
When it comes to defining rules that are applicable to completely general IR, the loop intrinsic call site feels *more* tangible than the notion of backedge. For example, backedges don't really work as a concept when you have irreducible control flow.

The loop intrinsic call site also really doesn't have to be in the header block of a natural loop -- it could be inside of an if-statement in the loop, for example, which has interesting consequences but can still be defined (and can actually be useful: someone pointed me at a recent paper by Damani et al - Speculative Reconvergence for Improve SIMT Efficiency, which proposes a certain "unnatural" way of controlling convergence in some kinds of loop for performance; the same kind of effect can be achieved by placing the loop heart inside of an if-statement).

================
Comment at: llvm/docs/ConvergentOperations.rst:464-470
+  while (counter >= 2) {
+    %tok = call token @llvm.experimental.convergence.anchor()
+    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
+    %tok = call token @llvm.experimental.convergence.anchor()
+    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
+    counter -= 2;
+  }
----------------
sameerds wrote:
> Which part of the formal semantics shows that this is a valid translation? Rule for the execution of dynamic instances seems to be useful to only specify which threads execute the convergent operations. But what relates them to the original loop? Is it because the set of dynamic instances produced by the second version has a one-to-one mapping with the set of dynamic instances produced by the first version?
The first version doesn't have a unique set of dynamic instances in the first place, because `anchor` is by design implementation-defined.

So the possible universes of dynamic instances in the transformed/unrolled version only needs to be a subset. In a sense, the loop unroll with remainder picks a subset by saying: from now on, if you have two threads with e.g. iteration counts 3 and 4, then they will **never** communicate during the 3rd iteration.

In the original program, they may or may not have communicated during the 3rd iteration -- up to the implementation, and in this case, the implementation decided to do a form of loop unrolling which implicitly ends up making a choice.

================
Comment at: llvm/docs/ConvergentOperations.rst:516
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic outside of the loop header uses a token defined outside of the loop
+can generally not be unrolled.
----------------
sameerds wrote:
> I think this intends to say "block in the loop body other than the loop header", but the wording chosen is a little difficult to parse on a first read.
Going to try an improvement :)

================
Comment at: llvm/docs/ConvergentOperations.rst:560
+ever communicates with threads that have the same ``condition`` value.
+By contrast, hoisting the convergent operations themselves is forbidden.
+
----------------
sameerds wrote:
> What forbids the convergent operations from being hoisted? Isn't that the whole point of this new framework? In particular, what would the total_gains/total_losses example look like with appropriate use of  convergence tokens?
I'm going to add that example.

================
Comment at: llvm/docs/ConvergentOperations.rst:202-203
+
+2. Executions of different static instructions always occur in different
+   dynamic instances.
+
----------------
simoll wrote:
> sameerds wrote:
> > simoll wrote:
> > > I suppose this only refers to convergent instructions but it isn't clear to me from the wording: Does this constraint apply to all IR instructions or only those that are convergent?
> > > (Only 4. explicitly mentions convergent operations)
> > I think the notion of dynamic instances applies to all instructions. Continuing with #3 below, it seems to me that different threads can execute the same dynamic instance of any instruction. It's just that this notion is not very interesting in the case of non-communicating instructions. The ones that communicate need to be marked convergent, so that the effect of transformations on them is limited.
> I'm more concerned about the implications this constraint may have for transformation like branch fusion.
> The memory model is pretty permissive and allows fusion of memory accesses regardless.
> @nhaehnle Do you care about non-memory side effects, like exceptions? Do these follow the same weak semantics as the memory model?
I'm not entirely sure what you mean by the question. There isn't supposed to be any interaction between exceptions and what's being described here. There aren't any relevant constraints expressed on the dynamic instances of non-convergent operations in the first place, and for convergent operations I'd think of them as happening in two steps: there's a cross-thread communication, and afterwards each thread individually decides whether it throws an exception in its context.

This can obviously take the exchanged data into account, to the point where you could model an operation as exchanging bits between threads to indicate whether an exception should be thrown in each thread -- so you could have an operation that throws an exception based on a value in another thread, as long as that other thread executes the same dynamic instance. Similarly, you could have UB in thread A based on an argument value in thread B as long as A and B execute the same dynamic instance.

I'm going to add an informational note to the end of this section that dynamic instances of non-convergent instructions don't matter.

================
Comment at: llvm/docs/ConvergentOperations.rst:212-214
+*Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
+``phi`` or ``select`` instructions. A convergence token value represents the
+dynamic instance of the instruction that produced it.
----------------
simoll wrote:
> This is actually super important and should probably go into the formal semantics: the token value represents the dynamic instance of the producing instruction.
> If the token represents the dynamic instance **exactly** then this would also limit the freedom `llvm.experimental.convergence.anchor()` has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation.
The logical split between the two sections is that this section has the basic definitions, while the "Formal Rules" section has the rules about how the convergence control intrinsics place additional constraints on how dynamic instances can be formed.

> If the token represents the dynamic instance exactly then this would also limit the freedom llvm.experimental.convergence.anchor() has. For example, this would rule out thread partitioning if it were so because then no token-producing instruction could return different token values per dynamic invocation.

I'm not sure I understand the argument. What exactly do you mean by dynamic invocation here?

Each time a thread executes the same anchor call site, it will receive a different token value, corresponding to a different dynamic instance. That may or may not be the same dynamic instance as received by other threads. So even if control flow is entirely uniform, an implementation would be free to produce a different thread partitioning each time the anchor is executed. That is on purpose: if you want more predictable thread partitionings, use a combination of `entry` and `loop` intrinsics as required.

================
Comment at: llvm/docs/ConvergentOperations.rst:290-293
+The expectation is that for program "main" functions, such as kernel entry
+functions, whose caller is not visible to LLVM, the implementation returns a
+convergence token that represents uniform control flow, i.e. that is guaranteed
+to refer to all threads within a (target- or environment-dependent) group.
----------------
simoll wrote:
> Not sure whether the expectation of uniformity makes sense here: there could be a caller with a non-uniform convergence token in a different module. This may only become apparent when everything is linked together.
> 
> Would this be a property of the calling convention of the kernel function (ie if it's a GPU kernel we know that the entry token is all-uniform).
The intention is that the IR-based rules still apply regardless of whether the caller is in the same module or not. I'm not sure if this needs to spelled out more clearly.

And yes, for other cases we should be able to think of it as a property of the calling convention.

================
Comment at: llvm/docs/ConvergentOperations.rst:339-343
+1. Let U be a static controlled convergent operation other than
+   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+   whose convergence token is produced by an instruction D. Two threads
+   executing U execute the same dynamic instance of U if and only if they
+   obtained the token value from the same dynamic instance of D.
----------------
simoll wrote:
> Should suffice to say that they two threads will execute the same instance if they see the same token value.
> Above you stated that the token value represents the dynamic instance of the defining instruction.
No, this is explicitly not sufficient. You can have:
```
  %tok = call token @llvm.experimental.convergence.anchor()
  br i1 %cc, label %then, label %next

then:
  call void @convergent_op() [ "convergencectrl"(token %tok) ]
  br label %next

next:
```

================
Comment at: llvm/docs/ConvergentOperations.rst:361-365
+   (Informational note: If the function is executed for some reason outside of
+   the scope of LLVM IR, e.g. because it is a kernel entry function, then this
+   rule does not apply. On the other hand, if a thread executes the function
+   due to a call from IR, then the thread cannot "spontaneously converge" with
+   threads that execute the function for some other reason.)
----------------
t-tye wrote:
> See comments above. Would it be possible to unify this with the definition of ``llvm.experimental.convergence.anchor``? That also needs defining here.
> 
> Seems this this rule could be left as is without the "If the function is executed for some reason outside of the scope of LLVM IR, e.g. because it is a kernel entry function, then this rule does not apply. On the other hand," part. And a new rule needs to be added to specify what the dynamic instance is for when F is not invoked by a ``call``, ``invoke``, or ``callbr`` instruction. That rule would reference the language semantics that defines how threads are partitioned into dynamic instances. For OpenCL that is based on the subgroup language definition, etc.
I think this comment may have moved to a confusing location relative to the document.

`entry` and `anchor` are inherently different.

I'm going to add a note about looking at language specs etc.

================
Comment at: llvm/docs/ConvergentOperations.rst:387-388
+
+4. If a convergence region contains a use of a convergence token, then it must
+   also contain its definition.
+
----------------
simoll wrote:
> Isn't 4. implied by the fact that this is SSA and the convergence region consists of all blocks that are dominated by the definition?
No, the rule excludes code such as:
```
  %a = call token @llvm.experimental.convergence.anchor()
  %b = call token @llvm.experimental.convergence.anchor()
  call void @convergent_op() [ "convergencectrl"(token %a) ]
  call void @convergent_op() [ "convergencectrl"(token %b) ]
```
The convergence region of `%b` contains a use of `%a` but not its definition.

I'm going to add a note about nesting.

================
Comment at: llvm/docs/ConvergentOperations.rst:404
+barrier sense nor in the control barrier sense of synchronizing the execution
+of threads.
+
----------------
t-tye wrote:
> efriedma wrote:
> > It's a bit of an exaggeration to say it has no effect on the memory model.  Consider the thread group reduction example: there's implicitly some bit of "memory" used to communicate.  (For the definition of readnone, "memory" is anything used to store/communicate state.)  Whether that bit of memory is the same for two instructions depends on whether they correspond to the same dynamic instance.
> > 
> > Of course, if you don't use any attributes, we'll conservatively assume that the memory accessed by an intrinsic depends on the current thread ID or something like that, so this is really only interesting if you're using readonly/readnone/etc.
> It does seem that traditionally the cross lane operations are not considered as using "memory" (in the sense of the language memory model) to do their communication. It is true that an implementation may use memory/storage to do this, but that is outside the memory behavior being defined by the language memory model.
> 
> One could argue that execution barriers are also communication and so may use storage/memory in their implementation, yet languages seem to choose to not include that in the memory model. Although those language may allow memory model semantics to be optionally specified in addition to the execution barrier semantics.
> 
> What is attractive about this formalism is it is clearly defining semantics for both cross thread execution communication, distinct from cross thread language memory model communication. The SIMD/SIMT languages [often informally] appear to have this distinction and this allows LLVM IR to model that set of semantics accurately.
I agree with @t-tye's explanation here. The choice here reflects the choice made e.g. in the Vulkan memory model: the only "convergent" operation (not the term used in Vulkan...) which interacts with the memory model is OpControlBarrier, so it's good to be able to treat these two kinds of communication orthogonally.

================
Comment at: llvm/docs/ConvergentOperations.rst:446
+  while (counter > 0) {
+    %tok = call tok @llvm.experimental.convergence.anchor()
+    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
----------------
t-tye wrote:
> arsenm wrote:
> > This and a lot of the later examples use "call tok" instead of the proper "call token"
> This seems to be the motivation for why llvm.experimental.convergence.anchor is wanted rather than a token flowing into the enclosing function.
> 
> Or could this transformation also be done if it used a token obtained from llvm.experimental.convergence.entry outside the loop? Why would this example not use llvm.experimental.convergence.loop since each loop iteration could involve a different dynamic instance? Or is that the point, this is explicitly saying all the threads that entered the loop must participate, and transformation cannot change this. But wouldn't using llvm.experimental.convergence.loop also enforce that in this case?
> 
> It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites). If one did do that could there be clear transformations to determine when this transformation is legal?
> It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites).

Yes, that is the point of `llvm.experimental.convergence.anchor`.

And yes, if there was clear "chain of custody" as you call it from outside of the loop, then this unrolling with remainder would be incorrect.

================
Comment at: llvm/docs/ConvergentOperations.rst:470
+are threads whose initial counter value is not a multiple of 2. That is allowed
+because the anchor intrinsic has implementation-defined convergence behavior
+and the loop unrolling transform is considered to be part of the
----------------
t-tye wrote:
> This confuses me. Shouldn't these intrinsics have well defined semantics so that source languages can map their semantics on to them? How is that possible if the intrinsics do not have well defined meaning? Their implementation would still be target/implementation defined.
I hope this has been answered in the context of your other comments?

================
Comment at: llvm/docs/ConvergentOperations.rst:507
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic outside of the loop header uses a token defined outside of the loop
+can generally not be unrolled.
----------------
t-tye wrote:
> t-tye wrote:
> > header,
> loop,
Is that still grammatically correct? The parse of the sentence is

> Loops in which ((a loop intrinsic outside of the loop header) uses a token defined outside of the loop)

That is, "a loop intrinsic outside of the loop header" is the subject of the sentence in the outer parentheses.

================
Comment at: llvm/docs/ConvergentOperations.rst:522-524
+Assuming that ``%tok`` is only used inside the conditional block, the anchor can
+be sunk. Again, the rationale is that the anchor has implementation-defined
+behavior, and the sinking is part of the implementation.
----------------
sameerds wrote:
> t-tye wrote:
> > This also confuses me. If anchor is supposed to denote the current set of threads in the current dynamic instance, then it seems undefined IR to use it in the conditional when all those threads cannot be performing the dynamic operation instance. I feel I am missing a fundamental aspect of the formal model.
> +1
> 
> To me, the whole point of this new concept is to capture control dependency so that we don't have to go look at branch conditions again. But allowing such a transformation reintroduces the need to go check the control dependency to understand which threads are really executing this instance.
I mean, `anchor` is implementation-defined, so you can't make a totally solid statement anyway. You could only make solid *relative* statements if the token produced by the anchor was also used by some other convergent operations, and if those are outside of the if-statement, the sinking wouldn't be allowed anymore anyway...

================
Comment at: llvm/docs/ConvergentOperations.rst:547-551
+  }
+
+The behavior is unchanged, since each of the static convergent operations only
+ever communicates with threads that have the same ``condition`` value.
+By contrast, hoisting the convergent operations themselves is forbidden.
----------------
sameerds wrote:
> t-tye wrote:
> > So the convergent token is the set of threads, but any intervening conditional control flow may change which threads a nested convergent operation may be required to communicate with?
> > 
> > My understanding was that the tokens were intended to be explicit in denoting the involved threads to avoid needing to crawl the LLVM IR to determine the control dependence. And were intended to be explicit in preventing control dependence changes. But these examples seem to contradict that understanding.
> > 
> > So when a convergent token is used in a dynamic instance of a static convergent operation, what set of threads is it mandating have to participate? Those defined by the dynamic instance of the static token definition that control dependence permits to execute?
> This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads.
CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.

================
Comment at: llvm/docs/ConvergentOperations.rst:575-578
+behavior could end up being different. If the anchor is inside the loop, then
+the grouping of threads during the execution of the anchor -- i.e., the sets of
+threads executing the same dynamic instance of it -- can change in an arbitrary,
+implementation-defined way in each iteration.
----------------
t-tye wrote:
> I think this is the part that I am struggling with. It feels like llvm.experimental.convergence.anchor is allowed to partition the threads in in arbitrary way. So how does that square with the language mandating how the threads must be partitioned?
Should be answered elsewhere.

================
Comment at: llvm/docs/ConvergentOperations.rst:604-605
+
+The rationale is that the anchor intrinsic has implementation-defined behavior,
+and the sinking transform is considered to be part of the implementation.
+
----------------
t-tye wrote:
> This seems to contradict the pixel example at the beginning. Or is this transformation allowed if it can be proven tat pure.convergent.operation does not rely on the result from the threads that would not execute the condition to true? How could that be done?
The pixel example would use `entry` instead of `anchor`. I'm going to add that example.

================
Comment at: llvm/docs/ConvergentOperations.rst:614-615
+
+Note that the entry intrinsic behaves differently. Sinking the convergent
+operations is forbidden in the following snippet:
+
----------------
t-tye wrote:
> Again still not clear how llvm.experimental.convergence.anchor can be allowed to be implementation defined. Or is this saying that when the set of threads is defined by the laguage llvm.experimental.convergence.entry must be used.
> 
> Maybe the graphics languages a looser in their execution model to allow arbitrary implementation of some aspects and that is what llvm.experimental.convergence.anchor is modeling? But it cannot be used for compute language that have [debatably] stronger rules?
Should be answered elsewhere.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603