[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Thu Aug 13 00:01:08 PDT 2020

nhaehnle added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:52-55
+The definitions in this document leave many details open, such as how groups of
+threads are formed in the first place. It focuses on the questions that are
+relevant for deciding the correctness of generic program transforms and
+convergence-related analyses such as divergence analysis.
----------------
sameerds wrote:
> nhaehnle wrote:
> > sameerds wrote:
> > > I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular:
> > > 
> > >   # In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence.
> > >   # In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside.
> > >   # In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.
> > > 
> > > This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?
> > > 
> > > 3.    In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.
> > >
> > > This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?
> > 
> > What ambiguities do you have in mind?
> > 
> > If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order.
> Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump threading. That's one of the "ambiguities" I had in mind when dealing with acyclic control flow. It's almost like the text needs a paragraph explaining that "structured acyclic control flow" already contains sufficient information about convergence, but general acyclic control flow needs special attention in specific cases, starting with jump threading.
I hesitate to write anything like that, because then you get into the problem of defining what "structured" means -- there are multiple definitions in the literature.

My argument would be that purely acyclic control flow -- whether structured or not -- contains sufficient information about convergence to  define semantics consistently, without assistance, and avoiding spooky action at a distance.

That you still need some assistance to make actual *guarantees* is really down to composability. For example, you can have a fully acyclic function called from inside a cycle, and then what happen at inlining. One can explore an alternative scheme where you don't have to insert anything into the acyclic function in this case and it's the job of the inlining transform to fix things up, and I have done some exploring in this direction. There are at least two downsides:

1. The burden on generic program transforms becomes larger.

2. There is no longer any way for the programmer to express the distinction between functions (or sub-sections of code) that cares about the set of threads with which they're executed vs. those that don't (like the `@reserveSpaceInBuffer` example I added), and that closes the door on certain performance optimization **and** becomes problematic if you want to start thinking about independent forward progress.

================
Comment at: llvm/docs/ConvergentOperations.rst:280
+We generally take the stance that reconvergence in acyclic control flow must
+be maximal. The compiler frontend could augment the original code as follows:
+
----------------
sameerds wrote:
> nhaehnle wrote:
> > sameerds wrote:
> > > It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence? 
> > No. The jump-threaded code could also come out of C(++) code with `goto`s, so this doesn't really work.
> But what about the flip side? If the frontend is sure that only structured control flow is present in the input program, can it skip inserting the convergence intrinsics? Or should it still insert those intrinsics just in case optimizations changed the graph? If yes, is this something that LLVM must prescribe for every frontend as part of this document?
It needs to insert the control intrinsics **if** it wants to have any guarantees. There aren't a lot of useful guarantees we can make today without this, so that's fine.

I don't want to say that frontends absolutely must insert the control intrinsics just yet, that's why uncontrolled convergent operations are allowed but deprecated. Frontends for languages with convergent operations that don't change will remain in the world of "things tend to work as expected a lot of the time, but stuff can break in surprising ways at the least convenient moment" that they are already in today. If they run the ConvergenceControlHeuristic pass just after IR generation, the times where things break will likely be somewhat reduced, but probably not eliminated entirely. It's difficult to make a definitive claim because there's obviously also the question of which guarantees the high-level language is supposed to give to the developer. For a HLL that just doesn't want give any guarantees, not inserting control intrinsics is fine from the POV of language spec correctness, although you're likely to run into corner cases where the language behavior clashes with developers' intuitive expectations.

================
Comment at: llvm/docs/ConvergentOperations.rst:611-612
+        dynamic instance of the defining instruction, and
+     2. There is an *n* such that both threads execute U for the *n*'th time
+        with that same token operand value.
+
----------------
sameerds wrote:
> nhaehnle wrote:
> > sameerds wrote:
> > > The older comments about this seem to have floated away. At risk of repeating the discussion, what is *n* capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example?
> > It's really just a loop iteration counter. Every time a thread executes the `loop` intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label.
> > 
> > Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the `loop` intrinsic. This means that if you have a natural loop where the `loop` intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere).
> > 
> > Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration.
> > Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.
> 
> This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?
> 
> > Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration.
> 
> This is a bit of surprise. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.
> 
> 
> >    Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.
>
> This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?

Well... what even **is** a loop iteration? :)

For the purpose of convergence, the loop heart defines what the iterations are, so it is reached on every iteration *by definition*. (But there may well be cycles in the CFG that don't contain a loop intrinsic, and that's fine.)

More likely your real question is whether in a natural loop, the loop intrinsic must be reached once per execution of the loop header (or traversal of a back edge) -- the answer is no.

Part of the rationale here (and also an unfortunately inherent source of potential confusion) is that for defining convergence, and more generally for implementing whole-program vectorization of the style we effectively do in AMDGPU, leaning only on natural loops doesn't work, at least in part because of the possibility of irreducible control flow. This is why all the actual algorithms I'm building on this rely on the Havlak-inspired CycleInfo of D83094, and all the rules in this document are expressed in terms of cycles (in the sense of circular walks in the CFG) instead of natural loops.

> My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.

I would have liked to have that property but couldn't make it work without imposing static rules that would be much harder to understand and follow. The point about unrolling is mentioned in the later examples section where I talk through a bunch of example loops and whether they can be unrolled or not.

================
Comment at: llvm/docs/ConvergentOperations.rst:749-754
+  %outer = call token @llvm.experimental.convergence.anchor()
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
+    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
+    counter--;
+  }
----------------
sameerds wrote:
> nhaehnle wrote:
> > sameerds wrote:
> > > So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1?
> > Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't.
> The n that I used is the virtual loop count that is described in the loop intrinsic. The example needs to explain how the rules established in this document prevent the unrolling. The intuitive explanation is in terms of sets of threads, but what is the formal explanation in terms of the static rules for dynamic instances?
The formal explanation **is** ultimately that the set of communicating threads is changed, but I agree that it could be helpful to spell out how that comes about via the rules on dynamic instances, so I'm going to do that.

================
Comment at: llvm/docs/ConvergentOperations.rst:759
+if the loop counter is known to be a multiple, then unrolling is allowed,
+though care must be taken to correct the use of the loop intrinsic.
+For example, unrolling by 2:
----------------
sameerds wrote:
> nhaehnle wrote:
> > sameerds wrote:
> > > Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop().
> > Correct.
> > 
> > I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution.
> I still don't really understand what the "two or more" rule is for. One outcome of the rule seems to be that for a loop L2 nested inside loop L1, if L1 uses a token defined outside L1, then L2 cannot use the same token. I didn't get very far beyond that.
I'm adding a "rationale" section specifically to explain those static rules about cycles.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603