[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Wed Aug 12 21:59:02 PDT 2020

sameerds added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:52-55
+The definitions in this document leave many details open, such as how groups of
+threads are formed in the first place. It focuses on the questions that are
+relevant for deciding the correctness of generic program transforms and
+convergence-related analyses such as divergence analysis.
----------------
nhaehnle wrote:
> sameerds wrote:
> > I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular:
> > 
> >   # In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence.
> >   # In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside.
> >   # In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.
> > 
> > This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?
> > 
> > 3.    In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.
> >
> > This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?
> 
> What ambiguities do you have in mind?
> 
> If you have a fully acyclic function, then the way you can think about it is: we determine "the" set of threads that execute the function at the entry. At every point in the function, the communication set is then the subset of threads that get to that point. It's easy to evaluate this if you just topologically sort the blocks and then evaluate them in that order.
Your explanation intuitively makes sense, but it is not clear how to reconcile it with jump threading. That's one of the "ambiguities" I had in mind when dealing with acyclic control flow. It's almost like the text needs a paragraph explaining that "structured acyclic control flow" already contains sufficient information about convergence, but general acyclic control flow needs special attention in specific cases, starting with jump threading.

================
Comment at: llvm/docs/ConvergentOperations.rst:280
+We generally take the stance that reconvergence in acyclic control flow must
+be maximal. The compiler frontend could augment the original code as follows:
+
----------------
nhaehnle wrote:
> sameerds wrote:
> > It was the optimizer that introduced the ambiguity ... should the optimizer be responsible for adding the necessary intrinsics that preserve the original convergence? 
> No. The jump-threaded code could also come out of C(++) code with `goto`s, so this doesn't really work.
But what about the flip side? If the frontend is sure that only structured control flow is present in the input program, can it skip inserting the convergence intrinsics? Or should it still insert those intrinsics just in case optimizations changed the graph? If yes, is this something that LLVM must prescribe for every frontend as part of this document?

================
Comment at: llvm/docs/ConvergentOperations.rst:611-612
+        dynamic instance of the defining instruction, and
+     2. There is an *n* such that both threads execute U for the *n*'th time
+        with that same token operand value.
+
----------------
nhaehnle wrote:
> sameerds wrote:
> > The older comments about this seem to have floated away. At risk of repeating the discussion, what is *n* capturing? Is it meant to relate copies of the call U created by unrolling the loop, for example?
> It's really just a loop iteration counter. Every time a thread executes the `loop` intrinsic, it executes a new dynamic instance of it. You could think of this dynamic instance being labeled by the iteration, and then whether a thread executes the same dynamic instance as another thread depends in part on whether they have the same loop iteration label.
> 
> Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the `loop` intrinsic. This means that if you have a natural loop where the `loop` intrinsic is not called in the header but in some other block that is conditional, the loop iterations will be counted in a way that seems funny (but this can actually be put to a potentially good use as I noted elsewhere).
> 
> Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration.
> Note that for the purpose of labeling, threads can never "skip" an iteration! They all start at 0 and increment when they reach the loop intrinsic.

This seems to be a defining characteristic for the heart of the loop. Must the heart be a place that is always reached on every iteration?

> Unrolling will actually not duplicate the `loop` intrinsic, but only keep the copy that corresponds to the first unrolled iteration.

This is a bit of surprise. My working assumption was that the call to the intrinsic is just like any other LLVM instruction, and it will be copied. Then the document needs to specify that the copy should be eliminated.

================
Comment at: llvm/docs/ConvergentOperations.rst:749-754
+  %outer = call token @llvm.experimental.convergence.anchor()
+  while (counter > 0) {
+    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
+    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
+    counter--;
+  }
----------------
nhaehnle wrote:
> sameerds wrote:
> > So unrolling is forbidden because it fails to preserve the set of threads that execute the same dynamic instance of loop() for n=0 and n=1?
> Not sure what you mean by n=0 and n=1. The issue is that if some threads go through the remainder loop while others execute more iterations, then the set of threads will be partitioned into those that take the remainder loop and those that don't.
The n that I used is the virtual loop count that is described in the loop intrinsic. The example needs to explain how the rules established in this document prevent the unrolling. The intuitive explanation is in terms of sets of threads, but what is the formal explanation in terms of the static rules for dynamic instances?

================
Comment at: llvm/docs/ConvergentOperations.rst:759
+if the loop counter is known to be a multiple, then unrolling is allowed,
+though care must be taken to correct the use of the loop intrinsic.
+For example, unrolling by 2:
----------------
nhaehnle wrote:
> sameerds wrote:
> > Correcting the use of the loop intrinsic seems to be a delicate matter. There is a rule which talks about "two or more uses by loop()" inside a loop body, and this particular example seems to side-step exactly that by eliminating one call to loop().
> Correct.
> 
> I did think about whether it was possible to eliminate that static rule, but it gets nasty really quickly, for example if you try to unroll loops with multiple exits. The way it's written, a modification to loop unrolling is required (D85605), but it's ultimately the less painful solution.
I still don't really understand what the "two or more" rule is for. One outcome of the rule seems to be that for a loop L2 nested inside loop L1, if L1 uses a token defined outside L1, then L2 cannot use the same token. I didn't get very far beyond that.

================
Comment at: llvm/docs/ConvergentOperations.rst:547-551
+  }
+
+The behavior is unchanged, since each of the static convergent operations only
+ever communicates with threads that have the same ``condition`` value.
+By contrast, hoisting the convergent operations themselves is forbidden.
----------------
nhaehnle wrote:
> sameerds wrote:
> > nhaehnle wrote:
> > > sameerds wrote:
> > > > t-tye wrote:
> > > > > So the convergent token is the set of threads, but any intervening conditional control flow may change which threads a nested convergent operation may be required to communicate with?
> > > > > 
> > > > > My understanding was that the tokens were intended to be explicit in denoting the involved threads to avoid needing to crawl the LLVM IR to determine the control dependence. And were intended to be explicit in preventing control dependence changes. But these examples seem to contradict that understanding.
> > > > > 
> > > > > So when a convergent token is used in a dynamic instance of a static convergent operation, what set of threads is it mandating have to participate? Those defined by the dynamic instance of the static token definition that control dependence permits to execute?
> > > > This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads.
> > > CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.
> > Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch.
> I don't understand what you're trying to get at here.
> 
> The semantics of modern CUDA builtins are fully captured by saying they're non-convergent, but they have a side effect. That side effect is communication with some set of other threads, but that set isn't affected by control flow, it's fully specified by an explicit argument. Because of this, there is no need to argue about dynamic instances.
> 
> All legal program transforms subject to those constraints are then legal. There is no need to label them as `convergent`. If you can think of a counter-example, I'd be curious to see it.
I am trying to understand whether there are constructs in Clang-supported high-level languages that cannot be addressed by these intrinsics. And if  such constructs do exist, then whether that gate the adoption of this enhancement in LLVM. But I see your point now. The sync() builtins in CUDA are no longer dependent on convergence. The decision to hoist or sink them is based entirely on other things like data dependences (and maybe just that).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603