[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Wed Oct 28 23:34:02 PDT 2020

jlebar added a subscriber: wash.
jlebar added a comment.

Hi.  :)  A few people pinged me asking for my feedback here, since I touched the convergent attr way back in the day, for CUDA.

I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context.  It's a lot.  Sorry that I'm probably bringing up things that have already been discussed.

I strongly agree that convergent as-is has problems.  Fixing them is clearly complicated, and it seems like a lot of work has gone into this proposal.

I have been out of it for too long to feel comfortable signing off on whether this proposal fixes the problems with convergent.  The proposal seems reasonable to me, but as we saw with e.g. undef/poison, these things can be extremely subtle.

I'm also not comfortable speaking to whether this representation will be ergonomic in the relevant LLVM passes.

What I'm more comfortable speaking to is:

- Is the proposal clear to me?

I think the proposal is clear, modulo my few comments (relative to the length of the patch, anyway).  This kind of writing is really tricky, I admire that I could mostly understand it.  I thought the extensive examples were really helpful.

- Is it clear how to modify clang's CUDA frontend to use this new form?

It's not perfectly clear to me how to do this.  Is it as simple as saying, loops always have a `convergent.loop()` intrinsic at the top, functions always have `convergent.entry()` at the top, and that's it?  If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

- Will this paint us into a corner wrt CUDA, and specifically sm70+?

/me summons @wash, who is probably a better person to speak to this than me.

My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples.  But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.

I admit I do not fully understand what the purpose of this is.  At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.

  // Warning, does not seem to work on sm75
  if (cond)
    __syncwarp(FULL_MASK);
  else
    __syncwarp(FULL_MASK);

but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make `cond` have different values within the warp.  So, guess not?

Anyway, clearly I don't fully understand the sm70+ convergence semantics.  I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal.  Then we should also double-check that clang can in fact generate the relevant LLVM IR.

Hope this helps.

================
Comment at: llvm/docs/ConvergentOperations.rst:27-29
+A convergent operation involves inter-thread communication outside of the
+memory model, where the set of threads which participate in communication is
+implicitly affected by control flow.
----------------
nhaehnle wrote:
> dsanders wrote:
> > This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit
> > 
> > I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.
> That's a fair point. The way I'm thinking about this is that there may be communication with a `void` payload, but ultimately this can be bikeshed to death.
CUDA `__syncthreads()` is the prototypical convergent function (at least, it was -- maybe under this definition it's not?), but syncthreads does not exchange any information.  It's just a barrier.

Assuming you still consider syncthreads to be convergent, my concern is someone would read this and (quite reasonably) think that we are incorrectly modeling it as convergent.

> The way I'm thinking about this is that there may be communication with a void payload,

If you count "communicate nil" as communication, then perhaps the operation is not in fact communication but rather is "communication or synchronization"?  Perhaps:

> A convergent operation involves inter-thread communication or synchronization that occurs outside of the memory model, where the set of threads which participate in the inter-thread operation is implicitly affected by control flow.

================
Comment at: llvm/docs/ConvergentOperations.rst:88
+    ...
+    color = textureSample(texture, coordinates);
+    if (condition) {
----------------
Up to you, but I think this example would be more evocative if we wrote out the definition of textureSample.  I am imagining that it involves something like a `__shfl`, but that's because I already understand GPUs.  Your audience is bigger than that.

================
Comment at: llvm/docs/ConvergentOperations.rst:220
+Consider an example of how jump threading removes structure in a way that can
+make semantics non-obvious:
+
----------------
Nit: Clarify that this example isn't using the proposed convergence intrinsics?  Perhaps

> Consider an example of how jump threading removes structure in a way that can make semantics non-obvious without the convergence intrinsics described in this document.

================
Comment at: llvm/docs/ConvergentOperations.rst:249
+  entry:
+      br i1 %cond1, label %then1, label %then2
+
----------------
Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code?

================
Comment at: llvm/docs/ConvergentOperations.rst:312
+actually reaches the call site. This set of threads doesn't change after
+jump-threading, so the answer to the question posed above remains the same.
+
----------------
This paragraph really clarifies for me what's going on.  +1

================
Comment at: llvm/docs/ConvergentOperations.rst:347
+
+  define i32 @reserveSpaceInBuffer() {    ; NOTE: _not_ a convergent function!
+  entry:
----------------
...wait, there are such things as convergent functions?  This is the first I'm hearing about it in the doc!  So far it seemed there were only convergent *calls*.  What's a convergent function?  :)

================
Comment at: llvm/docs/ConvergentOperations.rst:498
+Behavior is undefined if the containing function was called from IR without
+a ``convergencectrl`` bundle.
+
----------------
Do you plan to check this in the verifier (insofar as possible, I understand that it's not possible to check this for cross-TU calls).

================
Comment at: llvm/docs/ConvergentOperations.rst:506
+Behavior is undefined if this intrinsic appears in a function that isn't
+``convergent``.
+
----------------
This one is a local property -- could we say that this makes the program ill-formed, instead of UB?

================
Comment at: llvm/docs/ConvergentOperations.rst:510
+:ref:`convergence region <convergence_region>` or outside of a function's entry
+block.
+
----------------
Again, could we say this makes the program ill-formed?  (At least the entry-block check, I'm not sure what a convergence region is, yet.)

================
Comment at: llvm/docs/ConvergentOperations.rst:594
+
+1. Let U be a controlled convergent operation other than
+   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
----------------
Have we formally defined what a "controlled" convergent operation is?  Do you mean a `call` to a `convergent` function with a `"convergencectrl"` operand bundle?  (Say that?)

================
Comment at: llvm/docs/ConvergentOperations.rst:954
+all convergent operations refer back to an anchor inside the loop.
+For example (in pseudo-code):
+
----------------
In this section I would have found it helpful if we'd differentiated upfront between the three kinds of unrolling:

 - Partial unrolling of a loop with no known trip multiple (so, there's a "tail" that collects the remaining elements)
 - Partial unrolling by a trip multiple (so there's no "tail")
 - Full unrolling, which eliminates the loop

I think you're saying that only the first kind of unrolling is tricky.

================
Comment at: llvm/docs/ConvergentOperations.rst:981-982
+
+This is likely to change the behavior of the convergent operation if there
+are threads whose initial counter value is not a multiple of 2. That is allowed
+because the anchor intrinsic has implementation-defined convergence behavior
----------------
It would help me if we could we elaborate with half a sentence what the behavior change might be.

================
Comment at: llvm/docs/ConvergentOperations.rst:987-988
+
+If the loop contains uncontrolled convergent operations, this unrolling is
+forbidden.
+
----------------
Do you mean that this *kind of* unrolling is forbidden?

But if you're going to forbid *all* unrolling of loops with uncontrolled convergent ops...that's going to make CUDA code a lot slower.  Unless you're also going to fix clang, in which case, no objections, but maybe you want to say "will be forbidden once we've updated front-ends"?

================
Comment at: llvm/docs/ConvergentOperations.rst:998
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
+    ; (B)
----------------
One thing I don't get from this example is what I should do as a *frontend* to LLVM.  That is, when should I do this form, and when should I put a new anchor inside a loop?

It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor.  That is, I have to behave as though the compiler could transform

```
anchor()
foo();
```

into, effectively

```
if (threadIdx.x % 2 == 0) {
  anchor()
  convergent_fn();
} else {
  anchor();
  convergent_fn();
}
```

Something like this?

OK, so I always have to use the convergence.loop() form.  But then this is saying I can never unroll.

ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense?  But would help me if we were explicit about that.

================
Comment at: llvm/docs/ConvergentOperations.rst:1032
+  %outer = call token @llvm.experimental.convergence.anchor()
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
----------------
`counter > 1`?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603