[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Tue Sep 15 06:08:39 PDT 2020

nhaehnle added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:27-29
+A convergent operation involves inter-thread communication outside of the
+memory model, where the set of threads which participate in communication is
+implicitly affected by control flow.
----------------
dsanders wrote:
> This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit
> 
> I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.
That's a fair point. The way I'm thinking about this is that there may be communication with a `void` payload, but ultimately this can be bikeshed to death.

================
Comment at: llvm/docs/ConvergentOperations.rst:140-143
+forbidden. (A program transform can still sink the call if it can prove somehow,
+e.g. by leaning on target-specific callbacks that can analyze the program with
+additional knowledge, that ``%condition`` is always uniform across the threads
+referenced by the *convergence token* ``%entry``.)
----------------
dsanders wrote:
> I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then `%condition` doesn't need to be uniform across all the threads referenced by `%entry`. The only use is inside the `then:` block so I would expect that communicating threads for which `%condition` is uniformly false don't need to be considered as their result will not be used by any thread that enters `then:`. As you're trying to leave methods out, it's probably best left at `... with additional knowledge, that it doesn't change the result`
> 
> The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try?
> 
> For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than `convergencectrl` declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if `@textureSample` were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what `convergencectrl` declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the `convergent` attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done?
I can clean up the text.

As for the question of how generic transforms could do better in the future: the way I see it, this would involve divergence analysis. If `%condition` is uniform (in a suitably defined sense), then sinking the `@textureSample` is okay since it doesn't change the relevant set of threads. The downside is that divergence analysis tends to be relatively expensive. It's worth exploring whether it can be computed incrementally and preserved.

This particular example is an interesting one since it shows that scopes matter: on typical hardware, you really only need uniformity of `%condition` at the `quad` scope. I think that's worth exploring at some point, but it's definitely something to leave for later. I don't think there's anything in this proposal that would inherently prevent it.

================
Comment at: llvm/docs/ConvergentOperations.rst:247-251
+
+The expectation is that all threads within a group that "happen to be active at
+the same time" will execute the same dynamic instance, so that programs can
+detect the maximal set of threads that can communicate efficiently within
+some local region of the program.
----------------
dsanders wrote:
> nhaehnle wrote:
> > t-tye wrote:
> > > So where should it be defined what the set of threads should be? It seems it is not a target dependent concept as the target must implement the semantics of the programming language. So should each clang_lang define the initial set of threads at the construct that denotes the beginning of execution of of the various constructs? For example, an OpenCL kernel, a CUDA __device__ function, or a C/C++ `main` function.
> > > 
> > > Presumably later text will define how the set of threads is passed between a call site and a called function?
> > > 
> > > "happen to be active at the same time" does not seem the right sentiment. The programing language semantics will dictate what the set is. For example, OpenCL may define the set to be the work-items that are members of the same subgroup as defined by the OpenCL language. It is not all the work-items that start executing the dispatch grid as a whole which may reasonably also be considered to "happen to be active at the same time". So may be this needs to admit that the language prescribes the set? Then a reference to the language specific page that defines this in a "ClangRef" document?
> > For what you have in mind, you want to be looking at the `entry` intrinsic instead of the `anchor` intrinsic.
> > 
> > The `entry` intrinsic is used to form a relation with the group of converged threads at function entry, which for the kernel entry point would be the entire wave/workgroup/subgroup. For a called function, it would be the set of threads indicated by the `convergencectrl` operand bundle at the call site.
> > 
> > The `anchor` is there for us to explicitly be able to express: we don't care which threads go together; all we care about is that the operations that refer to the same anchor are executed with the same set of threads (subject to control flow and all the other rules).
> I feel like there's something I'm missing here. This sounds like:
> ```
> if (condition1) {
>   %token = anchor()
>   if (condition2) {
>      ...
>   }
>   sum() convergencectrl(%token)
> }
> ```
> can be rewritten to:
> ```
> if (condition1) {
>   if (condition2) {
>     %token = anchor()
>      ...
>     sum() convergencectrl(%token)
>   }
> }
> ```
> which made sense at first given statements like `we don't care which threads go together`, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed:
> ```
> if (condition1) {
>   %token = entry()
>   if (condition2) {
>      ...
>   }
>   sum() convergencectrl(%token)
> }
> ```
> and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case?
> 
> Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like `we don't care which threads go together` and `the code does not care about the exact set of threads with which it is executed` give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a `if(false) { ... }` to sink it all into.
That transform is allowed (assuming that sinking the user of the result of the `sum()` is also possible). Though either way, an implementation is free to isolate individual threads, i.e. in your example, the result of `sum` could just be replaced by the value you're summing over so that each thread just gets its own value. This may seem useless at first, but it is the point of the anchor :)

If you want the set of threads to have some fixed relation to something external (like a compute workgroup or full Vulkan subgroup), you need to use `entry` instead of `anchor`.

`anchor` is still useful, as long as you have multiple things anchored to it. It will then ensure that they are relatively consistent to each other.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603