[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Tue Aug 11 21:49:05 PDT 2020

sameerds added inline comments.

================
Comment at: llvm/docs/ConvergentOperations.rst:52-55
+The definitions in this document leave many details open, such as how groups of
+threads are formed in the first place. It focuses on the questions that are
+relevant for deciding the correctness of generic program transforms and
+convergence-related analyses such as divergence analysis.
----------------
I think I "get" it now, and it might be related to how this paragraph produces an expectation that is actually not intended. The entire time so far, I have been reading this document expecting a formal framework that completely captures convergence; something so complete, that one can point at any place in the program and merely look at the convergence intrinsics to decide whether a transform is valid. But that is not the case. This document becomes a lot more clear if the intrinsics being introduced are only meant to augment control flow but not replace it in the context of convergence. These intrinsics are only meant to be introduced by the frontend to remove ambiguity about convergence. In particular:

  # In the jump-threading example, the frontend inserts the convergence intrinsics to resolve the ambiguity in favour of maximal convergence.
  # In the loop-unroll example, the frontend disallows unrolling by inserting the anchor outside of the loop and using it inside.
  # In general acyclic control flow, control dependence is entirely sufficient to decide convergence, and the intrinsics have no additional effect. That is why it is okay to hoist/sink anchors in that case.

This last claim is a bit too strong to accept immediately. Is there a way to convince ourselves that the convergence intrinsics are really not required here? Perhaps an exhaustive enumeration of ambiguities that can exist?

================
Comment at: llvm/docs/ConvergentOperations.rst:547-551
+  }
+
+The behavior is unchanged, since each of the static convergent operations only
+ever communicates with threads that have the same ``condition`` value.
+By contrast, hoisting the convergent operations themselves is forbidden.
----------------
nhaehnle wrote:
> sameerds wrote:
> > t-tye wrote:
> > > So the convergent token is the set of threads, but any intervening conditional control flow may change which threads a nested convergent operation may be required to communicate with?
> > > 
> > > My understanding was that the tokens were intended to be explicit in denoting the involved threads to avoid needing to crawl the LLVM IR to determine the control dependence. And were intended to be explicit in preventing control dependence changes. But these examples seem to contradict that understanding.
> > > 
> > > So when a convergent token is used in a dynamic instance of a static convergent operation, what set of threads is it mandating have to participate? Those defined by the dynamic instance of the static token definition that control dependence permits to execute?
> > This is also the transform that CUDA (and potentially HIP) will disallow. Hoisting or sinking a conditional changes the set of threads executing the each leg of the branch. In CUDA, the two programs have completely different meanings depend on whether the anchor is outside the branch or inside each leg. There seems to be an opportunity here to relate the notion of an anchor to language builtins that return the mask of currently executing threads.
> CUDA is very different here: the builtins that take an explicit threadmask don't have an implicit dependence on control flow, so they shouldn't be modeled as convergent operations. They have other downsides, which is why we prefer to go down this path of convergent operations.
Combined with my other comment about the introduction, I think the current formalism is compatible with CUDA. One can say that some convergent functions in CUDA have additional semantics about how different dynamic instances communicate with each other. That communication is outside the scope of this document, where the mask argument is used to relate the dynamic instances. The current framework seems to be sufficient to govern the effect of optimizations on the dynamic instances. For example, it is sufficient that a CUDA ballot is not hoisted/sunk across a condition; the ballot across the two branch legs is managed by the mask, which was created before the branch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603