[PATCH] D147116: [RFC] Introduce convergence control intrinsics

Sun Jun 18 21:23:16 PDT 2023

sameerds added a comment.

In D147116#4429100 <https://reviews.llvm.org/D147116#4429100>, @efriedma wrote:

> The result of the change I'm suggesting is that for a loop like `while (true) { if (g()) { convergent_op(); break; } }`, convergent_op() actually stays inside the CFG loop.  If all operations lexically inside the loop are also inside the CFG loop, we don't need tokens to figure out which operations are lexically inside the loop.

Yeah, this is how we have always looked at convergence. The new tokens actually try to move away from that picture. A number of different angles to view this from:

1. It's not useful to always think in terms of "all threads". The tokens returned by the new intrinsics help further specify "which set of threads" converges at a given operation.
2. The implicit convergence derived from control dependences is kinda sufficient to work with "all threads". It is an approximation that allows a single-thread view to do safe things around convergent operations. But it is not sufficient to clearly specify what it is the actual relation between the CFG and the convergence of multiple threads.
3. For example, in the same loop or CFG region, etc, one convergent op might be interested in a local convergence captured by the `anchor` intrinsic, while another might be interested in the threads captured by the `loop` intrinsic. Now that loop intrinsic might itself have a token argument returned by an `anchor` intrinsic outside the loop. The subset relationship of all these threads is captured by the constraint on convergence regions.
4. Until code generation, it is sufficient to just record the relationship between sets of convergent threads. The usual transforms only have to follow the simple static rules about loop hearts and convergence regions to ensure correctness.
5. The transformation that you are thinking of is actually performed by the backend, where it will "pull" the convergent ops on the exit edges into the loop, and introduce proper mask manipulation to make sure that the right set of threads in a wave/warp executes it.
6. Until then, we do not actually want to pull that convergent op into the loop body. That will produce unnecessary constraints on transforms working with the loop. The convergent op is most definitely on the exit of the loop. And it's useful to keep it there.
7. The next step in these patches is to introduce an analysis (D85608 <https://reviews.llvm.org/D85608>) that captures "extended cycles" like the one we are discussing here. This will be used by other analyses and transforms that are "convergence aware" to reason about these extended cycles. This does not require the frontend or any other entity to modify the cycle structure, and no new rules are imposed on the LLVM IR. One example is an enhancement to UniformityAnalysis, where it will recognize some cases of "temporal divergence" that are actually uniform because they are on the exit path of this example.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147116/new/

https://reviews.llvm.org/D147116