[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Mon Aug 10 08:52:48 PDT 2020

nhaehnle marked 6 inline comments as done.
nhaehnle added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:29
+memory model, where the set of threads which participate in communication is
+implicitly defined or at least affected by control flow.
+
----------------
t-tye wrote:
> What would be an example where control flow affects without implicitly defining the set of threads?
Control flow alone is not enough to define the set of threads, because the initial set of threads is always defined in an environment-specific way, e.g. by how a kernel launch groups threads into waves and workgroups. I'm going to remove the "implicitly defined" part in the hope that that avoids confusion.

================
Comment at: llvm/docs/ConvergentOperations.rst:182
+* In an implementation that reconverges at post-dominators, threads reconverge
+  at ``mid`` in the first version, so that all threads (within a subgroup/wave)
+  that execute the control barrier do so together. In the second version,
----------------
t-tye wrote:
> Should wave be used here? Above the concept of SIMD is used so would SIMD instruction be a better term to use?
The term "subgroup" is used in the example code, which strongly hints at GLSL / SPIR-V / Vulkan terminology.

================
Comment at: llvm/docs/ConvergentOperations.rst:183-185
+  that execute the control barrier do so together. In the second version,
+  threads that reach the control barrier via different paths synchronize
+  separately.
----------------
t-tye wrote:
> Clarify why the second version is different? Perhaps:
> 
> In the second version, threads reconverge at `end`, causing threads that reach the control barrier via different paths to synchronize separately.
Going to add: "the first (and only) post-dominator is ``end``, so threads do not reconverge before then"

================
Comment at: llvm/docs/ConvergentOperations.rst:199
+
+1. Different executions of the same static instruction by a single thread
+   give rise to different dynamic instances of that instruction.
----------------
t-tye wrote:
> The notion of *static instruction* has not been defined. Above it simply uses the term *LLVM IR instruction*. Suggest either using that term here, or defining *static instruction* above.
I'm going to try to rephrase this a bit more explicitly.

================
Comment at: llvm/docs/ConvergentOperations.rst:205-206
+
+3. Executions of the same static instruction by different threads may occur in
+   the same dynamic instance.
+
----------------
t-tye wrote:
> This is an important concept to understand. Does more need to be said about the "may" part?
In a sense, that's what the rest of the document is about, so... hopefully not here? :)

================
Comment at: llvm/docs/ConvergentOperations.rst:223-224
+operations that have a ``convergencectrl`` operand bundle are considered
+*controlled* convergent operations. Other convergent operations are
+*uncontrolled*.
+
----------------
t-tye wrote:
> What other convergent operations exist that are not defined in this document? Seems would be good to enumerate them or provide a reference on where to find more about them.
Well, they'd be deprecated so I really don't want to talk too much about it... I'm going to rearrange this to hopefully make that clearer.

================
Comment at: llvm/docs/ConvergentOperations.rst:247-251
+
+The expectation is that all threads within a group that "happen to be active at
+the same time" will execute the same dynamic instance, so that programs can
+detect the maximal set of threads that can communicate efficiently within
+some local region of the program.
----------------
t-tye wrote:
> So where should it be defined what the set of threads should be? It seems it is not a target dependent concept as the target must implement the semantics of the programming language. So should each clang_lang define the initial set of threads at the construct that denotes the beginning of execution of of the various constructs? For example, an OpenCL kernel, a CUDA __device__ function, or a C/C++ `main` function.
> 
> Presumably later text will define how the set of threads is passed between a call site and a called function?
> 
> "happen to be active at the same time" does not seem the right sentiment. The programing language semantics will dictate what the set is. For example, OpenCL may define the set to be the work-items that are members of the same subgroup as defined by the OpenCL language. It is not all the work-items that start executing the dispatch grid as a whole which may reasonably also be considered to "happen to be active at the same time". So may be this needs to admit that the language prescribes the set? Then a reference to the language specific page that defines this in a "ClangRef" document?
For what you have in mind, you want to be looking at the `entry` intrinsic instead of the `anchor` intrinsic.

The `entry` intrinsic is used to form a relation with the group of converged threads at function entry, which for the kernel entry point would be the entire wave/workgroup/subgroup. For a called function, it would be the set of threads indicated by the `convergencectrl` operand bundle at the call site.

The `anchor` is there for us to explicitly be able to express: we don't care which threads go together; all we care about is that the operations that refer to the same anchor are executed with the same set of threads (subject to control flow and all the other rules).

================
Comment at: llvm/docs/ConvergentOperations.rst:283-285
+
+This intrinsic returns the convergence token that was used in the
+``convergencectrl`` operand bundle when the current function was called.
----------------
t-tye wrote:
> Would another way to achieve this be that the LLVM IR function itself have a convergencectrl bundle? This reflects that a function is "passed in" the set of threads.
I'm not sure what exactly the point is here. That's just how operand bundles work in LLVM: you add them to a call site, so the caller of the function that calls `entry` has to put the operand bundle there.

So the operand bundle is not part of the function type, although there is the correlation that if a function is `convergent`, you have to call it with a `convergencectrl` bundle or else there's undefined behavior as stated in the very next paragraph. At least for our purposes, having it not be part of the function type is perfectly fine.

================
Comment at: llvm/docs/ConvergentOperations.rst:293
+convergence token that represents uniform control flow, i.e. that is guaranteed
+to refer to all threads within a (target- or environment-dependent) group.
+
----------------
t-tye wrote:
> As mentioned above, is this dependent on the language semantics?
Kind of, though I would expect all this dependence to be captured by the target triple or some other environment factors, e.g. the calling convention used by a kernel entry point. That's why this particular document doesn't need to say anything here, at least formally.

================
Comment at: llvm/docs/ConvergentOperations.rst:299
+Behavior is undefined if this intrinsic appears inside of another convergence
+region or outside of a function's entry block.
+
----------------
efriedma wrote:
> Could we get away without the "outside of a function's entry block" restriction?  It seems sort of inconvenient that transforming a select to an if-then-else requires scanning the entire basic block.  I guess we have to do that scan anyway, though, given the way alloca is defined, so maybe not a big deal.
@arsenm:
> Is it legal for this to be called multiple times in the same function?

Yes, subject to the constraints listed here.

@t-tye:
> What is "another convergence region"? These tokens are deliberately not lexical scopes so they can describe unstructured control flow. So what is a "region" in this sense? [...]

This is defined later in the document. I'm going to add a proper link.

@efriedma
> Could we get away without the "outside of a function's entry block" restriction? It seems sort of inconvenient that transforming a select to an if-then-else requires scanning the entire basic block. I guess we have to do that scan anyway, though, given the way alloca is defined, so maybe not a big deal.

Right, the "only in the entry block" rule came about specifically by analogy with `alloca`s. In an early version, I only had "must not appear in a cycle", because that's all you need for the definition of convergence rules to work out.

However, function inlining then becomes more complicated because the entire inline function would have to be scanned for `entry` intrinsics. With the restriction to the entry block, we can just piggyback on the existing handling of `alloca`s. The same should apply for select-to-if/else. So it's largely a pragmatic choice.

================
Comment at: llvm/docs/ConvergentOperations.rst:347-351
+   execute the same dynamic instance of U if and only if
+   (1) they obtained the ``convergencectrl`` token operand value from the same
+   dynamic instance of the defining instruction, and
+   (2) there is an *n* such that both threads execute U for the *n*'th time
+   with that same token operand value.
----------------
t-tye wrote:
> ```
> if:
> 
>   1. They obtained the ``convergencectrl`` token operand value from the same dynamic instance of the defining instruction, and
>   2. There is an *n* such that both threads execute U for the *n*'th time with that same token operand value.
> ```
> Say you have a loop with a non-uniform trip count; does this mean the threads are allowed to communicate for the iterations that both threads execute?

Yes -- allowed to, and must. (I.e., this prevents unrolling with remainder, as is written later.)

================
Comment at: llvm/docs/ConvergentOperations.rst:365
+   due to a call from IR, then the thread cannot "spontaneously converge" with
+   threads that execute the function for some other reason.)
+
----------------
efriedma wrote:
> I'd prefer to define the "call stack" abstraction in a way that doesn't assume the whole world is LLVM IR.
@efriedma
> I'd prefer to define the "call stack" abstraction in a way that doesn't assume the whole world is LLVM IR.

It's already **explicitly** written in a way that doesn't assume the whole world is LLVM IR. The rule **only** makes a statement about what happens when the function is called from LLVM IR, and leaves open what happens if the function is called through some other mechanism. I don't see what else we can do here.

================
Comment at: llvm/docs/ConvergentOperations.rst:368
+4. Target-specific rules determine whether two threads execute the same
+   dynamic instance of an uncontrolled convergent operation.
+
----------------
efriedma wrote:
> "uncontolled divergent operation", meaning a convergent operation without a token?  Can you just say that's outside the scope of this document earlier, where you say it's deprecated?
Yes, going to make essentially that change.

================
Comment at: llvm/docs/ConvergentOperations.rst:377-380
+2. Every cycle in the CFG that contains two or more static uses of a
+   convergence token by
+   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
+   must also contain the definition of the token.
----------------
t-tye wrote:
> Is this intended to say:
> 
> ```
> 2. Every cycle in the CFG that contains two or more static uses of a convergence token T by :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>` must also contain the definition of T.
> ```
> 
> Or could T be different for each use?
> 
> Suggest a similar change to the previous rule to make it clearer.
Seems reasonable, will do.

================
Comment at: llvm/docs/ConvergentOperations.rst:382-383
+
+3. The *convergence region* corresponding to a convergence token T is the
+   region in which T is live (i.e., the subset of the dominance region of the
+   definition of T from which a use of T can be reached without leaving the
----------------
t-tye wrote:
> "the minimal region in which T is live and used"
> 
> Should this be clarified that it is the minimal live region (in the same way that phi nodes can be minimally created). Another interpretation of "live" allows the value to be live outside the dominance region.
> 
> The "(i.e. ...)" is not really a "namely". It is actually part of the definition of what "the region in which T is live" means unless the above change (or similar) is made.
> 
> Does "dominance region" need defining? T may be in many nested dominance regions, I assume here it means the minimal one? The set of blocks that are dominated by the immediate dominator of the block containing T?
> 
> Then what is the subset of the dominance region being defined by "convergence region"? How can the use of T happen outside the dominance region? Wouldn't that imply a phi? But above it was stated tokens cannot be used in a phi. Does the subset respect the blocks that the use must pass through to reach the block containing the use?
> 
> Or is the definition only blocks that are dominated by the block containing the definition of T that also use T or are on a path from the definition to the use of T? Again, how can there be blocks on a path between the definition of T and and a use of T that are not dominated by the block containing the definition of T given that phi nodes are not allowed to specify a token?
> 
> Maybe more explanation is needed?
I'm going to try to rephrase that.

================
Comment at: llvm/docs/ConvergentOperations.rst:390-393
+The freedom of targets to define the target-specific rule about uncontrolled
+convergent operations is limited by the following rule: A transform is correct
+for uncontrolled convergence operations if it does not make such operations
+control-dependent on additional values.
----------------
t-tye wrote:
> Is this the legacy definition of convergence that is now deprecated? Would it be good to clarify that? Perhaps the legacy rules should be in a separate section so they do not get muddled with the new rules, and can be deleted once the deprecated support is removed.
Since two people suggested this, I'm going to move it.

================
Comment at: llvm/docs/ConvergentOperations.rst:407
+Threads that execute the same dynamic instance do not necessarily do so at the
+same time.
+
----------------
t-tye wrote:
> efriedma wrote:
> > You don't really define "same time" anywhere. That's probably outside the scope of this document anyway, but not sure referring to it here adds anything.
> I think there is value in mentioning this, but it should be an "informational note".
> 
> The insight having this present is that it is the responsibility of the implementation to implement the "as if" semantics. This is comparable to the way the memory model is presenting an "as if" set of rules that the physical hardware may not in fact be literally doing.
> 
> The point being that these rules can be implemented on systems that do not have physical SIMD/SIMT hardware. In such systems the dynamic instruction instances may not be executed at the same time, and other means are used to ensure the communication happens correctly (perhaps staging buffers). This is even true on SIMD/SIMT hardware if the set of threads is larger than the SIMD/SIMT instruction size as it is for example if the subgroup size requires multiple waves/warps and scratchpad memory is used.
Right, that was exactly the intention here: make it plain as day to people that the requirement is only "as if"-semantics, not literal lock-step execution. I'm going to prefix this with "Informational note"

================
Comment at: llvm/docs/ConvergentOperations.rst:531
+  if (condition) {
+    %tok = @call tok llvm.experimental.convergence.anchor()
+    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
----------------
t-tye wrote:
> arsenm wrote:
> > %tok defined in both branches looks like broken SSA to me
> Which would mean a phi which is not allowed.
> 
> But again this is changing what set of threads %tok is denoting so I feel I am not understanding what a convergent token is fundamentally denoting. My thinking had been that the convergent tokens were a way that the high level language mapping to LLVM IR can communicated the language mandated convergence rules. But these examples seem to dis-spell that notion and make it a target dependent concept unrelated to the source language.
You're overanalyzing this. It's just a weird mash-up of C-like if-statements with LLVM IR-like notation that made me not think about the potential ways this could be interpreted. I'm going to rename the variables to disambiguate.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603