[llvm-dev] RFC: LLVM Coroutine Representation, Round 2

Mon Jul 11 06:47:09 PDT 2016

Hi all:

Thank you very much for the feedback during the first round! Special thanks to
Eli, Sanjoy, Hal and Chandler for their detailed public and private comments.
The model is simpler now and the intrinsics have nice symmetry to them:
coro.begin + coro.end, coro.alloc + coro.free. Compared to the previous RFC,
coro.fork is gone, coro.init and coro.start are collapsed into one coro.begin,
coro.suspend is three-way as opposed to two-way. coro.elide and coro.delete
renamed coro.alloc and coro.free.

All the changes implemented and tested. Full document, (nicely formatted by
github is at:

  https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst

Below is a summary of a proposal to add coroutine support to LLVM.
The proposal is motivated primarily by the desire to support C++ Coroutines [1],
however the llvm representation is language neutral and can be used
to support coroutines in other languages as well.

Clang + llvm coroutines allows you to take this code:

  generator<int> range(int from, int to) {
     for(int i = from; i < to; ++n)
        co_yield i;
  }
  int main() {
     int sum = 0;
     for (auto v: range(1,100))
        sum += v;
     return sum;
  }

And translate it down to this:

  define i32 @main() #5 {
  entry:
    ret i32 4950
  }

You can also use coroutines in plain C, by calling the builtins mapping to the
intrinsics described by this proposal, so that your coroutine can look like:

    #include "Inputs/coro.h"
    void print(int v);

    void* f(int n) {
      CORO_BEGIN(malloc);

      while (n-- > 0) {
        print(n+1);
        CORO_SUSPEND();
      }

      CORO_END(free);
    }

    // CHECK-LABEL: @main
    int main() {
      void* coro = f(4);
      CORO_RESUME(coro);
      CORO_RESUME(coro);
      CORO_DESTROY(coro);
    }

https://github.com/GorNishanov/clang/blob/coro-rfc/test/Coroutines/coro.c
https://github.com/GorNishanov/clang/blob/coro-rfc/test/Coroutines/Inputs/coro.h

I prototyped llvm changes to support this proposal and extended clang coroutine
implementation [2] to emit proposed intrinsics to smoke test the proposed
design. See "More Attention Required" in the doc/Coroutines.rst [4]for details
on what additional work is required beyond cleanup and bug fixes.

I would like to continue the discussion on the overall design, representation
and a roadmap to start upstreaming the changes with the intention to continue
the development in tree incrementally.

Roadmap:
--------
1) Get agreement on coroutine representation and overall direction (this mail).
  .. repeat 1) until happy <=== ARE ARE HERE
2) Get agreement on how coroutine transformation passes integrate into the
   optimizer pipeline.
  .. repeat 2) until happy
3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/LangRef.rst
4) trickle in coroutine transformation passes + tests in digestible chunks.
5) get clang changes in (sometime after step 3).
6) fix bugs / remove limitations / add functionality to coroutine passes
 .. repeat 6) until happy

Background
==========
LLVM coroutines are functions that have one or more `suspend points`.
When a suspend point is reached, the execution of a coroutine is suspended and
control is returned back to its caller. A suspended coroutine can be resumed
to continue execution from the last suspend point or it can be destroyed.

In the following example, function `f` (which may or may not be a coroutine
itself) returns a handle to a suspended coroutine (**coroutine handle**) that is
used by `main` to resume the coroutine twice and then destroy it:

.. code-block:: llvm

  define i32 @main() {
  entry:
    %hdl = call i8* @f(i32 4)
    call void @llvm.coro.resume(i8* %hdl)
    call void @llvm.coro.resume(i8* %hdl)
    call void @llvm.coro.destroy(i8* %hdl)
    ret i32 0
  }

In addition to the stack frame which exists when a coroutine is executing,
there is an additional region of storage that contains values that keep the
coroutine state when a coroutine is suspended. This region of storage
is called **coroutine frame**. It is created when a coroutine is called and
destroyed when a coroutine runs to completion or destroyed by a call to
the `coro.destroy`_ intrinsic. Unlike stackful coroutines [3] which maintains a
stack per coroutine, an llvm coroutine frame only contains the values that need
to persist across suspend points (there is a path from a use to the definition
that crosses a suspend point).

Overall idea is that a coroutine is presented to an llvm as an ordinary function
with suspension points marked up with intrinsics. We let the optimizer party
on the coroutine for awhile. Shortly before the coroutine is eligible to be
inlined into its callers, we outline parts of the coroutine that correspond to
the code that needs to get executed after the coroutine is resumed or destroyed.
The coroutine resumption intrinsics get replaced with direct calls to those
outlined functions where possible. Then inliner can inline much leaner and nicer
coroutine or any of the outlined parts as desired.

If we discover that the lifetime of a coroutine is fully enclosed in the
lifetime of its caller, we remove dynamic allocation for coroutine frame and
replace it with a static `alloca` on the caller's frame.

Please see doc/Coroutines.rst for more details:

https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst

Concerns:
=========
(This section assumes that you looked at doc/Coroutins.rst)

coro.begin: 4 arguments or 5?
-----------------------------

Background:

For standard allocation functions recognized by LLVM, coroutine frame allocation
code looks like:

  entry:
    %size = call i32 @llvm.coro.size.i32(i8* null)
    %alloc = call i8* @malloc(i32 %size)
    %hdl = call noalias i8* @llvm.coro.begin(i8* %alloc, i32 0, i8* null,
                                                                i8* null)

To enable coroutine heap elision with custom allocation functions, `coro.alloc`
intrinsic is used to exclude the allocation code when not needed.

  entry:
    %elide = call i8* @llvm.coro.alloc()
    %0 = icmp ne i8* %elide, null
    br i1 %0, label %coro.begin, label %coro.alloc

  coro.alloc:
    %frame.size = call i32 @llvm.coro.size()
    %alloc = call i8* @MyAlloc(i32 %frame.size)
    br label %coro.begin

  coro.begin:
    %phi = phi i8* [ %elide, %entry ], [ %alloc, %coro.alloc ]
    %frame = call i8* @llvm.coro.begin(i8* %phi, i32 0, i8* null, i8* null)

When using a static alloca for coroutine frame, coro.begin and coro.alloc
intrinsics are replaced with that very alloca (bitcasted to i8* to match the
type).

To find a coro.alloc matching a particular coro.begin, I hunt through the %phi
until I find definition of coro.alloc. Probably more robust implementation
need to use alias analysis to check whether the %phi and %elide may alias.

A simpler approach would be just to add another parameter to directly point to
`coro.alloc`. With this change, coro.begin block above would look like:

  coro.begin:
    %phi = phi i8* [ %elide, %entry ], [ %alloc, %coro.alloc ]
    %frame = call i8* @llvm.coro.begin(i8* %phi, i8 %elide, i32 0, i8* null,
                                                 ^^^^^^^^^         i8* null)

All feedback and comments are greatly appreciated!!!

Thank you,
Gor

References:
===========

[1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0057r4.pdf
[2] http://llvmweekly.org/issue/95 (Coroutine support added to clang)
[3] http://www.boost.org/doc/libs/1_61_0/libs/coroutine/doc/html/index.html
[4] https://github.com/GorNishanov/llvm/blob/coro-rfc/docs/Coroutines.rst