[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Tue Sep 8 22:04:34 PDT 2020

dsanders added a comment.

I've only read up to `Formal Rules` so later sections might change things but I figure it's potentially useful to see a readers thoughts mid-read. I'm pretty sure I've misunderstood the anchor `intrinsic` based on what I've read of the doc and comments so far.

================
Comment at: llvm/docs/ConvergentOperations.rst:27-29
+A convergent operation involves inter-thread communication outside of the
+memory model, where the set of threads which participate in communication is
+implicitly affected by control flow.
----------------
This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit

I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.

================
Comment at: llvm/docs/ConvergentOperations.rst:140-143
+forbidden. (A program transform can still sink the call if it can prove somehow,
+e.g. by leaning on target-specific callbacks that can analyze the program with
+additional knowledge, that ``%condition`` is always uniform across the threads
+referenced by the *convergence token* ``%entry``.)
----------------
I think this is a little misleading, IIUC and assuming that the sets of communicating threads are quads as mentioned above then `%condition` doesn't need to be uniform across all the threads referenced by `%entry`. The only use is inside the `then:` block so I would expect that communicating threads for which `%condition` is uniformly false don't need to be considered as their result will not be used by any thread that enters `then:`. As you're trying to leave methods out, it's probably best left at `... with additional knowledge, that it doesn't change the result`

The reason I bring this up is that I think it's worth thinking about how a generic transform, or an IR-level/gMIR-level/MIR-level target transform would perform this transform if it did understand convergence. To be clear, I'm not talking about the property it proves or the method by which it proves it. I mean: How would such a transform know what to prove and when to try?

For MIR and intrinsics, the answer seems obvious. The backend simply knows more about the instructions/intrinsics convergence than `convergencectrl` declares and can use that information instead. Once it recognizes an instruction/intrinsic as one it knows more about, it can try to prove whatever property it needs. However, outside of those special cases there doesn't seem to be a way to know what to prove or when to try, even for a target-specific pass. To use the above example, if `@textureSample` were a non-intrinsic function with the same properties you describe I don't think it would be possible to know any better than what `convergencectrl` declares, preventing the analysis the sinking transform would depend on. It's arguably out of scope for this doc but do you foresee convergence tokens and the `convergent` attribute becoming finer grained in future to support earlier or more target-independent transforms on convergent operations? Do you have any thoughts on how that would be done?

================
Comment at: llvm/docs/ConvergentOperations.rst:211-213
+Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
+among the relevant set of threads: in that case, the ``@subgroupAdd`` already
+communicates among the full set of threads in the original program.
----------------
Should we also mention that it's valid when %cc is non-uniform so long as the same effect is achieved by other means? In this particular example, additional communication is fine so long as we ensure unintended threads contribute 0 to the sums (e.g. by masking %delta using %cc first). In other words, it's not the actual communication we need to keep consistent but the effects (and side-effects) of that communication.

================
Comment at: llvm/docs/ConvergentOperations.rst:247-251
+
+The expectation is that all threads within a group that "happen to be active at
+the same time" will execute the same dynamic instance, so that programs can
+detect the maximal set of threads that can communicate efficiently within
+some local region of the program.
----------------
nhaehnle wrote:
> t-tye wrote:
> > So where should it be defined what the set of threads should be? It seems it is not a target dependent concept as the target must implement the semantics of the programming language. So should each clang_lang define the initial set of threads at the construct that denotes the beginning of execution of of the various constructs? For example, an OpenCL kernel, a CUDA __device__ function, or a C/C++ `main` function.
> > 
> > Presumably later text will define how the set of threads is passed between a call site and a called function?
> > 
> > "happen to be active at the same time" does not seem the right sentiment. The programing language semantics will dictate what the set is. For example, OpenCL may define the set to be the work-items that are members of the same subgroup as defined by the OpenCL language. It is not all the work-items that start executing the dispatch grid as a whole which may reasonably also be considered to "happen to be active at the same time". So may be this needs to admit that the language prescribes the set? Then a reference to the language specific page that defines this in a "ClangRef" document?
> For what you have in mind, you want to be looking at the `entry` intrinsic instead of the `anchor` intrinsic.
> 
> The `entry` intrinsic is used to form a relation with the group of converged threads at function entry, which for the kernel entry point would be the entire wave/workgroup/subgroup. For a called function, it would be the set of threads indicated by the `convergencectrl` operand bundle at the call site.
> 
> The `anchor` is there for us to explicitly be able to express: we don't care which threads go together; all we care about is that the operations that refer to the same anchor are executed with the same set of threads (subject to control flow and all the other rules).
I feel like there's something I'm missing here. This sounds like:
```
if (condition1) {
  %token = anchor()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}
```
can be rewritten to:
```
if (condition1) {
  if (condition2) {
    %token = anchor()
     ...
    sum() convergencectrl(%token)
  }
}
```
which made sense at first given statements like `we don't care which threads go together`, but we also have no way of saying that we did care which threads go together unless we also say that it must be the same as the threads from function entry. I'd originally expected that this would be allowed:
```
if (condition1) {
  %token = entry()
  if (condition2) {
     ...
  }
  sum() convergencectrl(%token)
}
```
and would prevent sinking into or hoisting out of either if-statement but your reply here seems to indicate that's not allowed. How do convergence tokens prevent hoisting/sinking for this case?

Having read a bit further and thought about it a bit more, I suspect what I'm missing is that anchor() is as immobile as it's name would suggest. However I haven't seen anything say it's immobile and things like `we don't care which threads go together` and `the code does not care about the exact set of threads with which it is executed` give me the impression that it can sink/hoist as long as the consumers of the token do too. My main thought that undermines my original reading is that if it can move then there'd be nothing stopping me deleting it either as I could always invent a `if(false) { ... }` to sink it all into.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603