[PATCH] D85603: IR: Add convergence control operand bundle and intrinsics

Mon Nov 30 06:17:52 PST 2020

jholewinski added a comment.

Few questions below. Please bear with me as I try to grok the proposal and the long stream of comments...

> I believe that with this proposal, we can model this with the attributes we have by saying that `subgroupShuffle` is `convergent readnone`, while `__shfl_sync` is `inaccessiblememonly`.

`subgroupShuffle` would require `convergentctrl` and `__shfl_sync` would not, correct?

> There is a straightforward way for a frontend to insert these intrinsics while ensuring correctness and not overly constraining optimization.

This feels like it could use a bit of discussion in the documentation, at least spelling out the straight-forward mapping for a simple C-like language with one example built-in that uses implicit thread masks. My understanding of this proposal implies the following rules:

1. Add call to `llvm.experimental.convergence.entry` to the beginning of every `convergent` function (generally assume the program entry-point is `convergent`)
2. Add call to `llvm.experimental.convergence.anchor` to the beginning of every non-`convergent` function
3. Add call to `llvm.experimental.convergence.loop` to the beginning of every natural loop header block
4. For each call to a `convergent` function (intrinsic or other-wise), attach `convergencectrl` bundle pointing to the closest call to `entry`/`anchor`/`loop`, in terms of nesting

Is this correct for general structured code, or am I missing some case?

Things are less clear when you consider odd looping structures, for example:

  entry:
      entry_token = llvm.experimental.convergence.anchor();
      if (cond1) goto head1;
      else goto head2;

  head1:
      head1_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
      cond2 = ...;
      if cond2 goto tail1;
      else goto tail2;

  head2:
      head2_loop_token = llvm.experimental.convergence.loop() [ "convergencectrl"(entry_token) ]
      break_cond = ...
      if break_cond goto exit;
      else goto head2b;

  head2b:
      cond3 = ...;
      if cond3 goto tail2;
      else goto tail1;

  tail1:
      cond4 = subgroupOp(...);      // What does this anchor to?
      if cond4 goto head1;
      else goto head2;

  tail2:
      cond5 = subgroupOp(...);      // What does this anchor to?
      if cond5 goto head2;
      else goto head1;

  exit:
      ...

Ignoring the new intrinsics, the `subgroupOp` calls may have defined semantics between groups of threads that reach the two loop tail blocks together, at least if you guarantee maximal convergence. When you add in the proposed intrinsics, what do the `subgroupOp` calls anchor to? The `llvm.experimental.convergence.loop` calls do not dominate the `subgroupOp` calls, so the `llvm.experimental.convergence.anchor` call is the only choice. But this breaks the first rule static rule of cycles in the proposal: we have a use of a convergence token inside a cycle but the token is not defined inside the loop. Do we just add a call to `llvm.experimental.convergence.anchor` to both `tail1` and `tail2`, using them as the convergence token for the respective `subgroupOp`?

Is there any concern over implementing and validating all of the necessary logic in all CFG-altering optimization passes? This seems like a large change that will require modifications through-out the existing optimization passes in order to achieve the goal of "ensuring correctness and not overly constraining optimization". I'm just curious if there is any plan being formulated for tackling this issue.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85603/new/

https://reviews.llvm.org/D85603