[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Wed Sep 23 09:27:01 PDT 2020

nhaehnle added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:247-251
+
+The expectation is that all threads within a group that "happen to be active at
+the same time" will execute the same dynamic instance, so that programs can
+detect the maximal set of threads that can communicate efficiently within
+some local region of the program.
----------------
sameerds wrote:
> sameerds wrote:
> > nhaehnle wrote:
> > > dsanders wrote:
> > > > nhaehnle wrote:
> > > > > t-tye wrote:
> > > > > > So where should it be defined what the set of threads should be? It seems it is not a target dependent concept as the target must implement the semantics of the programming language. So should each clang_lang define the initial set of threads at the construct that denotes the beginning of execution of of the various constructs? For example, an OpenCL kernel, a CUDA __device__ function, or a C/C++ `main` function.
> > > > > > 
> > > > > > Presumably later text will define how the set of threads is passed between a call site and a called function?
> > > > > > 
> > > > > > "happen to be active at the same time" does not seem the right sentiment. The programing language semantics will dictate what the set is. For example, OpenCL may define the set to be the work-items that are members of the same subgroup as defined by the OpenCL language. It is not all the work-items that start executing the dispatch grid as a whole which may reasonably also be considered to "happen to be active at the same time". So may be this needs to admit that the language prescribes the set? Then a reference to the language specific page that defines this in a "ClangRef" document?
> > > > > For what you have in mind, you want to be looking at the `entry` intrinsic instead of the `anchor` intrinsic.
> > > > > 
> > > > > The `entry` intrinsic is used to form a relation with the group of converged threads at function entry, which for the kernel entry point would be the entire wave/workgroup/subgroup. For a called function, it would be the set of threads indicated by the `convergencectrl` operand bundle at the call site.
> > > > > 
> > > > > The `anchor` is there for us to explicitly be able to express: we don't care which threads go together; all we care about is that the operations that refer to the same anchor are executed with the same set of threads (subject to control flow and all the other rules).
> > > > I feel like there's something I'm missing here. This sounds like:
> > > > ```
> > > > if (condition1) {
> > > >   %token = anchor()
> > > >   if (condition2) {
> > > >      ...
> > > >   }
> > > >   sum() convergencectrl(%token)
> > > > }
> > > > ```
> > > > can be rewritten to:
> > > > ```
> > > > if (condition1) {
> > > >   if (condition2) {
> > > >     %token = anchor()
> > > >      ...
> > > >     sum() convergencectrl(%token)
> > > >   }
> > > > }
> > > > ```
> > > > which made sense at first given statements like `we don't care which threads go together`, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed:
> > > > ```
> > > > if (condition1) {
> > > >   %token = entry()
> > > >   if (condition2) {
> > > >      ...
> > > >   }
> > > >   sum() convergencectrl(%token)
> > > > }
> > > > ```
> > > > and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case?
> > > > 
> > > > Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like `we don't care which threads go together` and `the code does not care about the exact set of threads with which it is executed` give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a `if(false) { ... }` to sink it all into.
> > > That transform is allowed (assuming that sinking the user of the result of the `sum()` is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of `sum` could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :)
> > > 
> > > If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use `entry` instead of `anchor`.
> > > 
> > > `anchor` is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other.
> > If I understand this right, then even `entry` does not capture anything specific ... it is merely a place holder for the `anchor` at the callsite of a function. This matters, for example, when the call is inside a loop and the frontend is trying to specify something in terms of the threads that together enter the loop. The `entry` at the start of a kernel is almost the same as an `anchor`, except the definition of threads that see the same dynamic instance is coming from the  language above rather than the implementation below.
> > 
> > The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can  constrain transformations in the compiler.
> > 
> > Is that correct?
> Just realized that this is not true: "The entry at the start of a kernel is almost the same as an anchor", but the rest still seems to hold.
> The end result is that none of these intrinsics can be used to dictate how the implementation must preserve threadgroups. They can only be used to "lift" the concurrent execution that already exists in the target to a form that can constrain transformations in the compiler.

Probably? I'm not sure I agree with the exact wording. In a compute kernel, the `entry` intrinsic preserves the set of threads (workgroup/threadgroup/block) that are launched together, where "together" is parameterized by the scope you care about (dispatch/workgroup/subgroup/wave/whatever you call it). `loop` intrinsics controlled by the resulting token value in turn preserve that set of threads modulo divergent exits from the loop. And so on.

So I'd state it as: the intrinsics cannot enforce any grouping that wasn't there before, they can **only** enforce preservation of groupings.

I hope that's what you meant, just with different words? :)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603