[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Sun Aug 9 13:48:48 PDT 2020

t-tye added a comment.

Sorry, just lots of questions:-)

================
Comment at: llvm/docs/ConvergentOperations.rst:29
+memory model, where the set of threads which participate in communication is
+implicitly defined or at least affected by control flow.
+
----------------
What would be an example where control flow affects without implicitly defining the set of threads?

================
Comment at: llvm/docs/ConvergentOperations.rst:182
+* In an implementation that reconverges at post-dominators, threads reconverge
+  at ``mid`` in the first version, so that all threads (within a subgroup/wave)
+  that execute the control barrier do so together. In the second version,
----------------
Should wave be used here? Above the concept of SIMD is used so would SIMD instruction be a better term to use?

================
Comment at: llvm/docs/ConvergentOperations.rst:183-185
+  that execute the control barrier do so together. In the second version,
+  threads that reach the control barrier via different paths synchronize
+  separately.
----------------
Clarify why the second version is different? Perhaps:

In the second version, threads reconverge at `end`, causing threads that reach the control barrier via different paths to synchronize separately.

================
Comment at: llvm/docs/ConvergentOperations.rst:191
+
+
+Dynamic Instances and Convergence Tokens
----------------
.. _dynamic_instances_and_convergence_tokens:

================
Comment at: llvm/docs/ConvergentOperations.rst:199
+
+1. Different executions of the same static instruction by a single thread
+   give rise to different dynamic instances of that instruction.
----------------
The notion of *static instruction* has not been defined. Above it simply uses the term *LLVM IR instruction*. Suggest either using that term here, or defining *static instruction* above.

================
Comment at: llvm/docs/ConvergentOperations.rst:205-206
+
+3. Executions of the same static instruction by different threads may occur in
+   the same dynamic instance.
+
----------------
This is an important concept to understand. Does more need to be said about the "may" part?

================
Comment at: llvm/docs/ConvergentOperations.rst:223-224
+operations that have a ``convergencectrl`` operand bundle are considered
+*controlled* convergent operations. Other convergent operations are
+*uncontrolled*.
+
----------------
What other convergent operations exist that are not defined in this document? Seems would be good to enumerate them or provide a reference on where to find more about them.

================
Comment at: llvm/docs/ConvergentOperations.rst:244
+
+This intrinsic is a marker that acts as an "anchor" producing an initial
+convergence token. The set of threads executing the same dynamic instance of
----------------
```
*anchor*
```

================
Comment at: llvm/docs/ConvergentOperations.rst:245-246
+This intrinsic is a marker that acts as an "anchor" producing an initial
+convergence token. The set of threads executing the same dynamic instance of
+this intrinsic is implementation-defined.
+
----------------
See questions below. I had been assuming that the set of threads would be well defined by the source language and not be an implementation defined concept. I was thinking this is present to model source language semantics, not different target implementation approaches. I feel I am missing something.

================
Comment at: llvm/docs/ConvergentOperations.rst:247-251
+
+The expectation is that all threads within a group that "happen to be active at
+the same time" will execute the same dynamic instance, so that programs can
+detect the maximal set of threads that can communicate efficiently within
+some local region of the program.
----------------
So where should it be defined what the set of threads should be? It seems it is not a target dependent concept as the target must implement the semantics of the programming language. So should each clang_lang define the initial set of threads at the construct that denotes the beginning of execution of of the various constructs? For example, an OpenCL kernel, a CUDA __device__ function, or a C/C++ `main` function.

Presumably later text will define how the set of threads is passed between a call site and a called function?

"happen to be active at the same time" does not seem the right sentiment. The programing language semantics will dictate what the set is. For example, OpenCL may define the set to be the work-items that are members of the same subgroup as defined by the OpenCL language. It is not all the work-items that start executing the dispatch grid as a whole which may reasonably also be considered to "happen to be active at the same time". So may be this needs to admit that the language prescribes the set? Then a reference to the language specific page that defines this in a "ClangRef" document?

================
Comment at: llvm/docs/ConvergentOperations.rst:263
+
+This intrinsic defines the "heart" of a loop, i.e. the place where an imaginary
+loop counter is incremented for the purpose of determining convergence
----------------
```
*heart*
```

================
Comment at: llvm/docs/ConvergentOperations.rst:283-285
+
+This intrinsic returns the convergence token that was used in the
+``convergencectrl`` operand bundle when the current function was called.
----------------
Would another way to achieve this be that the LLVM IR function itself have a convergencectrl bundle? This reflects that a function is "passed in" the set of threads.

================
Comment at: llvm/docs/ConvergentOperations.rst:293
+convergence token that represents uniform control flow, i.e. that is guaranteed
+to refer to all threads within a (target- or environment-dependent) group.
+
----------------
As mentioned above, is this dependent on the language semantics?

================
Comment at: llvm/docs/ConvergentOperations.rst:295-296
+
+Behavior is undefined if this intrinsic appears in a function that isn't
+``convergent``.
+
----------------
arsenm wrote:
> Should this just be a verifier error? Why make it undefined?
The formal model needs to state the legality. It would in addition be good to have the verifier enforce the requirement.

================
Comment at: llvm/docs/ConvergentOperations.rst:298-299
+
+Behavior is undefined if this intrinsic appears inside of another convergence
+region or outside of a function's entry block.
+
----------------
arsenm wrote:
> arsenm wrote:
> > This should also just be a verifier check?
> Is it legal for this to be called multiple times in the same function?
I would assume it can be called multiple times in the same function provided it is not in another convergence region. If the token were an operand (or result of?) of the LLVM IR function then seems this would become simpler as would simply reference that value.

What is "another convergence region"? These tokens are deliberately not lexical scopes so they can describe unstructured control flow. So what is a "region" in this sense?

Is it that only one token should be being used per dynamic "region" instance? That ``llvm.experimental.convergence.loop`` is effectively partitioning the parent token into the loop iteration instances and it is not meaningful to use the parent token inside one of those loop instances? Basically within in one post-dominator region only one token should be used? Maybe this is all explained in the formal section.

In any case, the term "another convergence region" needs defining. Since it cannot be outside the functions entry block, how can it be in another region anyway?

When is ``llvm.experimental.convergence.entry`` as opposed to ``llvm.experimental.convergence.anchor`` used? Seems they are both conceptually doing the same thing. When would ``llvm.experimental.convergence.anchor`` be used, since the start of the program is typically also a function and ``llvm.experimental.convergence.entry`` could simply be capturing that "outside LLVM" token?

================
Comment at: llvm/docs/ConvergentOperations.rst:338
+Rules on the execution of dynamic instances:
+
+1. Let U be a static controlled convergent operation other than
----------------
Should the rules start by defining what "static controlled convergent operation" means as that term is used in the following rule? There is a definition above for "controlled convergent operation". The "static" part seems undefined as mentioned above.

A "uncontrolled convergent operation" also needs to be defined as it is used in the last rule.

================
Comment at: llvm/docs/ConvergentOperations.rst:347-351
+   execute the same dynamic instance of U if and only if
+   (1) they obtained the ``convergencectrl`` token operand value from the same
+   dynamic instance of the defining instruction, and
+   (2) there is an *n* such that both threads execute U for the *n*'th time
+   with that same token operand value.
----------------
```
if:

  1. They obtained the ``convergencectrl`` token operand value from the same dynamic instance of the defining instruction, and
  2. There is an *n* such that both threads execute U for the *n*'th time with that same token operand value.
```

================
Comment at: llvm/docs/ConvergentOperations.rst:361-365
+   (Informational note: If the function is executed for some reason outside of
+   the scope of LLVM IR, e.g. because it is a kernel entry function, then this
+   rule does not apply. On the other hand, if a thread executes the function
+   due to a call from IR, then the thread cannot "spontaneously converge" with
+   threads that execute the function for some other reason.)
----------------
See comments above. Would it be possible to unify this with the definition of ``llvm.experimental.convergence.anchor``? That also needs defining here.

Seems this this rule could be left as is without the "If the function is executed for some reason outside of the scope of LLVM IR, e.g. because it is a kernel entry function, then this rule does not apply. On the other hand," part. And a new rule needs to be added to specify what the dynamic instance is for when F is not invoked by a ``call``, ``invoke``, or ``callbr`` instruction. That rule would reference the language semantics that defines how threads are partitioned into dynamic instances. For OpenCL that is based on the subgroup language definition, etc.

================
Comment at: llvm/docs/ConvergentOperations.rst:377-380
+2. Every cycle in the CFG that contains two or more static uses of a
+   convergence token by
+   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+   must also contain the definition of the token.
----------------
Is this intended to say:

```
2. Every cycle in the CFG that contains two or more static uses of a convergence token T by :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>` must also contain the definition of T.
```

Or could T be different for each use?

Suggest a similar change to the previous rule to make it clearer.

================
Comment at: llvm/docs/ConvergentOperations.rst:382-383
+
+3. The *convergence region* corresponding to a convergence token T is the
+   region in which T is live (i.e., the subset of the dominance region of the
+   definition of T from which a use of T can be reached without leaving the
----------------
"the minimal region in which T is live and used"

Should this be clarified that it is the minimal live region (in the same way that phi nodes can be minimally created). Another interpretation of "live" allows the value to be live outside the dominance region.

The "(i.e. ...)" is not really a "namely". It is actually part of the definition of what "the region in which T is live" means unless the above change (or similar) is made.

Does "dominance region" need defining? T may be in many nested dominance regions, I assume here it means the minimal one? The set of blocks that are dominated by the immediate dominator of the block containing T?

Then what is the subset of the dominance region being defined by "convergence region"? How can the use of T happen outside the dominance region? Wouldn't that imply a phi? But above it was stated tokens cannot be used in a phi. Does the subset respect the blocks that the use must pass through to reach the block containing the use?

Or is the definition only blocks that are dominated by the block containing the definition of T that also use T or are on a path from the definition to the use of T? Again, how can there be blocks on a path between the definition of T and and a use of T that are not dominated by the block containing the definition of T given that phi nodes are not allowed to specify a token?

Maybe more explanation is needed?

================
Comment at: llvm/docs/ConvergentOperations.rst:390-393
+The freedom of targets to define the target-specific rule about uncontrolled
+convergent operations is limited by the following rule: A transform is correct
+for uncontrolled convergence operations if it does not make such operations
+control-dependent on additional values.
----------------
Is this the legacy definition of convergence that is now deprecated? Would it be good to clarify that? Perhaps the legacy rules should be in a separate section so they do not get muddled with the new rules, and can be deleted once the deprecated support is removed.

================
Comment at: llvm/docs/ConvergentOperations.rst:404
+barrier sense nor in the control barrier sense of synchronizing the execution
+of threads.
+
----------------
efriedma wrote:
> It's a bit of an exaggeration to say it has no effect on the memory model.  Consider the thread group reduction example: there's implicitly some bit of "memory" used to communicate.  (For the definition of readnone, "memory" is anything used to store/communicate state.)  Whether that bit of memory is the same for two instructions depends on whether they correspond to the same dynamic instance.
> 
> Of course, if you don't use any attributes, we'll conservatively assume that the memory accessed by an intrinsic depends on the current thread ID or something like that, so this is really only interesting if you're using readonly/readnone/etc.
It does seem that traditionally the cross lane operations are not considered as using "memory" (in the sense of the language memory model) to do their communication. It is true that an implementation may use memory/storage to do this, but that is outside the memory behavior being defined by the language memory model.

One could argue that execution barriers are also communication and so may use storage/memory in their implementation, yet languages seem to choose to not include that in the memory model. Although those language may allow memory model semantics to be optionally specified in addition to the execution barrier semantics.

What is attractive about this formalism is it is clearly defining semantics for both cross thread execution communication, distinct from cross thread language memory model communication. The SIMD/SIMT languages [often informally] appear to have this distinction and this allows LLVM IR to model that set of semantics accurately.

================
Comment at: llvm/docs/ConvergentOperations.rst:407
+Threads that execute the same dynamic instance do not necessarily do so at the
+same time.
+
----------------
efriedma wrote:
> You don't really define "same time" anywhere. That's probably outside the scope of this document anyway, but not sure referring to it here adds anything.
I think there is value in mentioning this, but it should be an "informational note".

The insight having this present is that it is the responsibility of the implementation to implement the "as if" semantics. This is comparable to the way the memory model is presenting an "as if" set of rules that the physical hardware may not in fact be literally doing.

The point being that these rules can be implemented on systems that do not have physical SIMD/SIMT hardware. In such systems the dynamic instruction instances may not be executed at the same time, and other means are used to ensure the communication happens correctly (perhaps staging buffers). This is even true on SIMD/SIMT hardware if the set of threads is larger than the SIMD/SIMT instruction size as it is for example if the subgroup size requires multiple waves/warps and scratchpad memory is used.

================
Comment at: llvm/docs/ConvergentOperations.rst:446
+  while (counter > 0) {
+    %tok = call tok @llvm.experimental.convergence.anchor()
+    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
----------------
arsenm wrote:
> This and a lot of the later examples use "call tok" instead of the proper "call token"
This seems to be the motivation for why llvm.experimental.convergence.anchor is wanted rather than a token flowing into the enclosing function.

Or could this transformation also be done if it used a token obtained from llvm.experimental.convergence.entry outside the loop? Why would this example not use llvm.experimental.convergence.loop since each loop iteration could involve a different dynamic instance? Or is that the point, this is explicitly saying all the threads that entered the loop must participate, and transformation cannot change this. But wouldn't using llvm.experimental.convergence.loop also enforce that in this case?

It still feels like llvm.experimental.convergence.anchor is materializing the set of threads out of thin air rather than as a clear "chain of custody" from the function entry (transitively passed via call sites). If one did do that could there be clear transformations to determine when this transformation is legal?

================
Comment at: llvm/docs/ConvergentOperations.rst:470
+are threads whose initial counter value is not a multiple of 2. That is allowed
+because the anchor intrinsic has implementation-defined convergence behavior
+and the loop unrolling transform is considered to be part of the
----------------
This confuses me. Shouldn't these intrinsics have well defined semantics so that source languages can map their semantics on to them? How is that possible if the intrinsics do not have well defined meaning? Their implementation would still be target/implementation defined.

================
Comment at: llvm/docs/ConvergentOperations.rst:507
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic outside of the loop header uses a token defined outside of the loop
+can generally not be unrolled.
----------------
header,

================
Comment at: llvm/docs/ConvergentOperations.rst:507
+:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+intrinsic outside of the loop header uses a token defined outside of the loop
+can generally not be unrolled.
----------------
t-tye wrote:
> header,
loop,

================
Comment at: llvm/docs/ConvergentOperations.rst:522-524
+Assuming that ``%tok`` is only used inside the conditional block, the anchor can
+be sunk. Again, the rationale is that the anchor has implementation-defined
+behavior, and the sinking is part of the implementation.
----------------
This also confuses me. If anchor is supposed to denote the current set of threads in the current dynamic instance, then it seems undefined IR to use it in the conditional when all those threads cannot be performing the dynamic operation instance. I feel I am missing a fundamental aspect of the formal model.

================
Comment at: llvm/docs/ConvergentOperations.rst:531
+  if (condition) {
+    %tok = @call tok llvm.experimental.convergence.anchor()
+    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
----------------
arsenm wrote:
> %tok defined in both branches looks like broken SSA to me
Which would mean a phi which is not allowed.

But again this is changing what set of threads %tok is denoting so I feel I am not understanding what a convergent token is fundamentally denoting. My thinking had been that the convergent tokens were a way that the high level language mapping to LLVM IR can communicated the language mandated convergence rules. But these examples seem to dis-spell that notion and make it a target dependent concept unrelated to the source language.

================
Comment at: llvm/docs/ConvergentOperations.rst:547-551
+  }
+
+The behavior is unchanged, since each of the static convergent operations only
+ever communicates with threads that have the same ``condition`` value.
+By contrast, hoisting the convergent operations themselves is forbidden.
----------------
So the convergent token is the set of threads, but any intervening conditional control flow may change which threads a nested convergent operation may be required to communicate with?

My understanding was that the tokens were intended to be explicit in denoting the involved threads to avoid needing to crawl the LLVM IR to determine the control dependence. And were intended to be explicit in preventing control dependence changes. But these examples seem to contradict that understanding.

So when a convergent token is used in a dynamic instance of a static convergent operation, what set of threads is it mandating have to participate? Those defined by the dynamic instance of the static token definition that control dependence permits to execute?

================
Comment at: llvm/docs/ConvergentOperations.rst:575-578
+behavior could end up being different. If the anchor is inside the loop, then
+the grouping of threads during the execution of the anchor -- i.e., the sets of
+threads executing the same dynamic instance of it -- can change in an arbitrary,
+implementation-defined way in each iteration.
----------------
I think this is the part that I am struggling with. It feels like llvm.experimental.convergence.anchor is allowed to partition the threads in in arbitrary way. So how does that square with the language mandating how the threads must be partitioned?

================
Comment at: llvm/docs/ConvergentOperations.rst:604-605
+
+The rationale is that the anchor intrinsic has implementation-defined behavior,
+and the sinking transform is considered to be part of the implementation.
+
----------------
This seems to contradict the pixel example at the beginning. Or is this transformation allowed if it can be proven tat pure.convergent.operation does not rely on the result from the threads that would not execute the condition to true? How could that be done?

================
Comment at: llvm/docs/ConvergentOperations.rst:614-615
+
+Note that the entry intrinsic behaves differently. Sinking the convergent
+operations is forbidden in the following snippet:
+
----------------
Again still not clear how llvm.experimental.convergence.anchor can be allowed to be implementation defined. Or is this saying that when the set of threads is defined by the laguage llvm.experimental.convergence.entry must be used.

Maybe the graphics languages a looser in their execution model to allow arbitrary implementation of some aspects and that is what llvm.experimental.convergence.anchor is modeling? But it cannot be used for compute language that have [debatably] stronger rules?

================
Comment at: llvm/docs/LangRef.rst:1485
+    involving control flow are forbidden. For a detailed description, see the
+    `Convergent Operations <ConvergentOperations.html>`_ document.

----------------
Use the Sphinx document reference:

```
:doc:`ConvergentOperations`
```

================
Comment at: llvm/docs/LangRef.rst:1488
     The optimizer may remove the ``convergent`` attribute on functions when it
-    can prove that the function does not execute any convergent operations.
+    can prove that the function does not execute uncontrolled convergent
+    operations or ``llvm.experimental.convergent.entry``.
----------------
Add reference to the defintiion of the term?

```
(see :ref:`dynamic_instances_and_convergence_tokens`)
```

================
Comment at: llvm/docs/LangRef.rst:2293
+When present, the operand bundle must contain exactly one value of token type.
+See the `Convergent Operations <ConvergentOperations.html>`_ document for
+details.
----------------

```
:doc:`ConvergentOperations`
```

================
Comment at: llvm/docs/LangRef.rst:16107
+operations, which all start with the ``llvm.experimental.convergence.``
+prefix, are described in the `Convergent Operations <ConvergentOperations.html>`_
+document.
----------------

```
:doc:`ConvergentOperations`
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603