[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Fri Oct 30 02:42:50 PDT 2020

nhaehnle added a comment.

In D85603#2361168 <https://reviews.llvm.org/D85603#2361168>, @jlebar wrote:

> I'm going to try to give feedback, but with the caveat that there's a huge amount of discussion here, and with my apologies that I can't read the whole thread's worth of context.  It's a lot.  Sorry that I'm probably bringing up things that have already been discussed.

Thanks, and don't worry. A lot of the old comments don't make sense anymore because the document was changed and Phabricator shows them in nonsensical places unfortunately.

[snip]

> - Is it clear how to modify clang's CUDA frontend to use this new form?
>
> It's not perfectly clear to me how to do this.  Is it as simple as saying, loops always have a `convergent.loop()` intrinsic at the top, functions always have `convergent.entry()` at the top, and that's it?  If you &co aren't planning to do this work (I know the CUDA frontend shares a lot of code with the HIP frontend), I'd want to be sure that the people who *are* going to do this work (@tra?) are clear on what needs to be done and think it's possible.

There are two kinds of answers to this. One is that you can only //really// know how the frontend should be modified once you've established what the high-level language semantics ought to be. Part of why I'm doing this work is to enable us to experiment with this kind of question and verify our understanding what this should look like (I'm going to caveat this with saying that I'm coming at it from the graphics side).

The other kind of answer is that for most but not all constructs, there's a pretty natural answer that boils down pretty much to what you wrote. Of course it generally breaks down in the face of `goto`, for example. I have a follow-on patch, D85609 <https://reviews.llvm.org/D85609>, which adds a pass that does this kind of insertion on top of LLVM IR. I'd appreciate your review on that if you find the time -- I think what it tries to do is fairly natural, but it is a bit more work to dig through. A reasonable first step for someone working on the CUDA frontend would be to insert that pass early in the pass pipeline. Longer term, it may be necessary to insert them directly during IR generation, but this at least partially depends on the high-level language semantics question.

> - Will this paint us into a corner wrt CUDA, and specifically sm70+?
>
> /me summons @wash, who is probably a better person to speak to this than me.
>
> My understanding is that the semantics of <sm70 convergent are pretty similar to what is described in these examples.  But starting in sm70+, each sync operation takes an arg specifying which threads in the warp participate in the instruction.
>
> I admit I do not fully understand what the purpose of this is.  At one point in time I thought it was to let humans write (or compilers generate) code like this, where the identity of the convergent instruction does not matter.
>
>   // Warning, does not seem to work on sm75
>   if (cond)
>     __syncwarp(FULL_MASK);
>   else
>     __syncwarp(FULL_MASK);
>
> but my testcase, https://gist.github.com/50d1b5fedc926c879a64436229c1cc05, dies with an illegal-instruction error (715) when I make `cond` have different values within the warp.  So, guess not?
>
> Anyway, clearly I don't fully understand the sm70+ convergence semantics.  I'd ideally like someone from nvidia (hi, @wash) to speak to whether we can represent their convergent instruction semantics using this proposal.  Then we should also double-check that clang can in fact generate the relevant LLVM IR.

I have trouble answering this as well due to the lack of proper specification from Nvidia, and I'm not set up to run this kind of experiment.

>From a theory point of view, because those newer versions of sync operations take that explicit arg, we shouldn't consider them to be convergent according to what's being defined here. Only the `__activemask()` builtin probably still needs to be considered convergent (also in light of https://bugs.llvm.org/show_bug.cgi?id=47210).

The result of your experiment seems to contradict the theory. Having worked on this part of our compiler for a while now, I think it's entirely possible that the result of your experiment is simply a bug somewhere along the compiler stack, but of course I can't say for certain. If it's not supposed to be a bug, then to me this means there's something subtle missing in the way the new sync operations are described. Either way, some clarification would be good.

================
Comment at: llvm/docs/ConvergentOperations.rst:27-29
+A convergent operation involves inter-thread communication outside of the
+memory model, where the set of threads which participate in communication is
+implicitly affected by control flow.
----------------
jlebar wrote:
> nhaehnle wrote:
> > dsanders wrote:
> > > This is rather nit-picky but there's some convergent operations where inter-thread communication isn't happening depending on how you model it. For example, a population count could be modelled as threads communicating (sum of 0 or 1 responses) which fits the definition here, but it could also be modelled threads optionally communicating (count of responses received), or as an external thread-manager broadcasting its count to the threads. Either way, communication is still happening but the second and third models are stretching the definition a bit
> > > 
> > > I don't think it's worth bogging down the main text for that nitpick but it might be worth clarifying in a footnote or something that receiving/sending any data from, to, or about another thread counts as communication. Also, declining to communicate counts as communication if it affects the outcome.
> > That's a fair point. The way I'm thinking about this is that there may be communication with a `void` payload, but ultimately this can be bikeshed to death.
> CUDA `__syncthreads()` is the prototypical convergent function (at least, it was -- maybe under this definition it's not?), but syncthreads does not exchange any information.  It's just a barrier.
> 
> Assuming you still consider syncthreads to be convergent, my concern is someone would read this and (quite reasonably) think that we are incorrectly modeling it as convergent.
> 
> > The way I'm thinking about this is that there may be communication with a void payload,
> 
> If you count "communicate nil" as communication, then perhaps the operation is not in fact communication but rather is "communication or synchronization"?  Perhaps:
> 
> > A convergent operation involves inter-thread communication or synchronization that occurs outside of the memory model, where the set of threads which participate in the inter-thread operation is implicitly affected by control flow.
Your suggestion looks good to me, going to apply it.

================
Comment at: llvm/docs/ConvergentOperations.rst:88
+    ...
+    color = textureSample(texture, coordinates);
+    if (condition) {
----------------
jlebar wrote:
> Up to you, but I think this example would be more evocative if we wrote out the definition of textureSample.  I am imagining that it involves something like a `__shfl`, but that's because I already understand GPUs.  Your audience is bigger than that.
`textureSample` is actually a built-in function of graphics languages. I'm going to add a clause to try to clarify that. I assume all GPUs have dedicated circuitry for it. I specifically wanted to mention `textureSample` in the document at least once because it (and some close analogs) are often forgotten in discussions of convergent even by graphics people like myself.

Obviously the document should also be accessible to folks from the GPU compute world, which is why I tried to give a succinct explanation of the relevant facts about `textureSample` in the paragraph above.

Later in the document there are also examples using shuffles, though with the Khronos-y spelling of `subgroupShuffle` instead of the CUDA-y `__shfl`. The choice of spelling is partly because that's just the world I'm personally working in most of my time, but also partly because I'd prefer using terms from common industry standards. I understand that CUDA is a bit of a de facto "standard", so if you think it's necessary to convert at least one example to CUDA spelling, we can do that -- just not this one here in particular, because it's specifically meant to be a graphics shader example.

================
Comment at: llvm/docs/ConvergentOperations.rst:220
+Consider an example of how jump threading removes structure in a way that can
+make semantics non-obvious:
+
----------------
jlebar wrote:
> Nit: Clarify that this example isn't using the proposed convergence intrinsics?  Perhaps
> 
> > Consider an example of how jump threading removes structure in a way that can make semantics non-obvious without the convergence intrinsics described in this document.
Thanks, going to make this change.

================
Comment at: llvm/docs/ConvergentOperations.rst:249
+  entry:
+      br i1 %cond1, label %then1, label %then2
+
----------------
jlebar wrote:
> Nit: Add ellipsis above this line, or remove it in the equivalent spot in the original code?
Added ellipsis.

================
Comment at: llvm/docs/ConvergentOperations.rst:347
+
+  define i32 @reserveSpaceInBuffer() {    ; NOTE: _not_ a convergent function!
+  entry:
----------------
jlebar wrote:
> ...wait, there are such things as convergent functions?  This is the first I'm hearing about it in the doc!  So far it seemed there were only convergent *calls*.  What's a convergent function?  :)
Uhh... technically true. How about adding something like the following somewhere:

> In LLVM IR, function calls are the only instructions that can involve convergent
> operations. A call itself (i.e., the act of jumping to the callee, setting up a
> stack frame, etc.) is not a convergent operation. However, if the callee uses
> the ``llvm.experimental.convergence.entry`` intrinsic, then we think of the
> entire execution of the callee as a convergent operation from the perspective of
> the calling function. Such callees must be marked with the ``convergent``
> attribute, and for brevity we say that they are "convergent functions". If the
> callee isn't known at the call site (i.e., an indirect function call), then the
> ``call`` instruction itself must have the ``convergent`` attribute.
> 
> The only reason for why a function F would need to use the
> ``llvm.experimental.convergence.entry`` intrinsic is if F in turn uses some
> other convergent operation, i.e., a call to a convergent function. Chains of
> such calls are expected to eventually end with the use of a (target-specific)
> intrinsic that is ``convergent``.

================
Comment at: llvm/docs/ConvergentOperations.rst:498
+Behavior is undefined if the containing function was called from IR without
+a ``convergencectrl`` bundle.
+
----------------
jlebar wrote:
> Do you plan to check this in the verifier (insofar as possible, I understand that it's not possible to check this for cross-TU calls).
Do we typically check "mere UB" in the verifier? Thinking about it a little, doing this seems risky for IR linking: it would mean that you can link two well-formed modules together and end up with an ill-formed one? If that's something that already exists and we're okay with it, then I'd be happy to add such checks, but I wouldn't want to be the one to introduce them...

================
Comment at: llvm/docs/ConvergentOperations.rst:506
+Behavior is undefined if this intrinsic appears in a function that isn't
+``convergent``.
+
----------------
jlebar wrote:
> This one is a local property -- could we say that this makes the program ill-formed, instead of UB?
Yes, that's a good idea.

================
Comment at: llvm/docs/ConvergentOperations.rst:510
+:ref:`convergence region <convergence_region>` or outside of a function's entry
+block.
+
----------------
jlebar wrote:
> Again, could we say this makes the program ill-formed?  (At least the entry-block check, I'm not sure what a convergence region is, yet.)
The entry-block check should be straightforward.

================
Comment at: llvm/docs/ConvergentOperations.rst:594
+
+1. Let U be a controlled convergent operation other than
+   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
----------------
jlebar wrote:
> Have we formally defined what a "controlled" convergent operation is?  Do you mean a `call` to a `convergent` function with a `"convergencectrl"` operand bundle?  (Say that?)
Yes, the section "Dynamic Instances and Convergence Tokens" already says this:

> The convergence control intrinsics described in this document and convergent
> operations that have a ``convergencectrl`` operand bundle are considered
> *controlled* convergent operations.

I'm going to add an anchor there since the doc is pretty long :)

================
Comment at: llvm/docs/ConvergentOperations.rst:954
+all convergent operations refer back to an anchor inside the loop.
+For example (in pseudo-code):
+
----------------
jlebar wrote:
> In this section I would have found it helpful if we'd differentiated upfront between the three kinds of unrolling:
> 
>  - Partial unrolling of a loop with no known trip multiple (so, there's a "tail" that collects the remaining elements)
>  - Partial unrolling by a trip multiple (so there's no "tail")
>  - Full unrolling, which eliminates the loop
> 
> I think you're saying that only the first kind of unrolling is tricky.
Yes, that's correct, and I'm going to add essentially your three bullets at the top.

================
Comment at: llvm/docs/ConvergentOperations.rst:981-982
+
+This is likely to change the behavior of the convergent operation if there
+are threads whose initial counter value is not a multiple of 2. That is allowed
+because the anchor intrinsic has implementation-defined convergence behavior
----------------
jlebar wrote:
> It would help me if we could we elaborate with half a sentence what the behavior change might be.
I gave it a try. It ended up being a full sentence though ;)

================
Comment at: llvm/docs/ConvergentOperations.rst:987-988
+
+If the loop contains uncontrolled convergent operations, this unrolling is
+forbidden.
+
----------------
jlebar wrote:
> Do you mean that this *kind of* unrolling is forbidden?
> 
> But if you're going to forbid *all* unrolling of loops with uncontrolled convergent ops...that's going to make CUDA code a lot slower.  Unless you're also going to fix clang, in which case, no objections, but maybe you want to say "will be forbidden once we've updated front-ends"?
Yes, this kind of unrolling. This is already forbidden for uncontrolled convergent operations today. If you want to dig a little deeper, I would appreciate if you could also add your review to D85605. That's a follow-up change for (1) correctness of loop unrolling with regards to the `loop` intrinsics rules and (2) relaxing some of the constraints that exist today where possible when all convergent ops are controlled (by an anchor in the loop).

================
Comment at: llvm/docs/ConvergentOperations.rst:998
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
+    ; (B)
----------------
jlebar wrote:
> One thing I don't get from this example is what I should do as a *frontend* to LLVM.  That is, when should I do this form, and when should I put a new anchor inside a loop?
> 
> It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor.  That is, I have to behave as though the compiler could transform
> 
> ```
> anchor()
> foo();
> ```
> 
> into, effectively
> 
> ```
> if (threadIdx.x % 2 == 0) {
>   anchor()
>   convergent_fn();
> } else {
>   anchor();
>   convergent_fn();
> }
> ```
> 
> Something like this?
> 
> OK, so I always have to use the convergence.loop() form.  But then this is saying I can never unroll.
> 
> ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense?  But would help me if we were explicit about that.
> ITYM that with convergence.loop(), I can never *partially unroll with a "tail"*, which makes a lot of sense?

Yes, that's correct. Hopefully clearer with the addition at the top of the section.

> It seems to me that in (say) CUDA, the compiler can ~never insert an anchor, because inserting an anchor is tantamount to allowing arbitrary divergence right before the anchor.

Right. The anchor essentially allows you to achieve the same thing as `__activemask` in CUDA, but in a more structured way that doesn't run into problems when you have two sides of an if/else both executing a sync operation with the same thread mask.

================
Comment at: llvm/docs/ConvergentOperations.rst:1032
+  %outer = call token @llvm.experimental.convergence.anchor()
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
----------------
jlebar wrote:
> `counter > 1`?
Thanks, changing to `counter >= 2` because that's what I had in a similar example above.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603