[llvm-dev] Fwd: [RFC] LLVM Coroutines

Wed Jun 8 22:57:26 PDT 2016

Hi all:

Below is a proposal to add experimental coroutine support to LLVM. Though this
proposal is motivated primarily by the desire to support C++ Coroutines [1],
the llvm representation is language neutral and can be used to support
coroutines in other languages as well.

Clang + llvm coroutines allows you to take this code:

  generator<int> range(int from, int to) {
     for(int i = from; i < to; ++n)
        co_yield i;
  }
  int main() {
     int sum = 0;
     for (auto v: range(1,100))
        sum += v;
     return sum;
  }

And translate it down to this:

  define i32 @main() #5 {
  entry:
    ret i32 4950
  }

I prototyped llvm changes to support this proposal and extended clang coroutine
implementation [2] to emit proposed intrinsics to smoke test the proposed
design and to get simple generator and async task style coroutines working.
See "More Attention Required" in the doc/Coroutines.rst for details on what
addition work is required beyond cleanup and bug fixes.

I would like to start the discussion on the overall design, representation and
a roadmap to start upstreaming the changes with the intention to continue the
development in-tree incrementally.

Roadmap:
--------
1) Get agreement on coroutine representation and overall direction (this mail).
  .. repeat 1) until happy
2) Get agreement on how coroutine transformation passes integrate into the
   optimizer pipeline.
  .. repeat 2) until happy
3) update IR/Intrinsics.td + doc/Coroutines.rst + doc/index.rst
4) trickle in coroutine transformation passes + tests in digestible chunks.
5) get clang changes in (sometime after step 3).
6) fix bugs / remove limitations / add functionality to coroutine passes
 .. repeat 6) until happy

Background
==========
Coroutine representation is the result of several offline discussions
with Richard Smith,
Reid Kleckner, and David Majnemer. All of the good came from the
collective effort.
All the yuckiness was sneaked in by me after those discussions.

I'll start with a quick overview.

LLVM coroutines are functions that have one or more `suspend points`.
When a suspend point is reached, the execution of a coroutine is suspended and
control is returned back to its caller. A suspended coroutine can be resumed
to continue execution from the last suspend point or be destroyed.

In the following example, function `f` (which may or may not be a
coroutine itself)
returns a handle to a suspended coroutine (**coroutine handle**) that is used
by `main` to resume the coroutine twice and then destroy it:

.. code-block:: llvm

  define i32 @main() {
  entry:
    %hdl = call i8* @f(i32 4)
    call void @llvm.experimental.coro.resume(i8* %hdl)
    call void @llvm.experimental.coro.resume(i8* %hdl)
    call void @llvm.experimental.coro.destroy(i8* %hdl)
    ret i32 0
  }

In addition to the function stack frame which exists when a coroutine
is executing,
there is an additional region of storage that contains values that keep the
coroutine state when a coroutine is suspended. This region of storage
is called **coroutine frame**. It is created when a coroutine is called. It is
destroyed when a coroutine runs to completion or destroyed by a call to
the `coro.destroy`_ intrinsic. Unlike stackful coroutines [3] which maintains a
stack per coroutine, an llvm coroutine frame only contains the values that need
to persist across suspend points (there is a path from a use to the definition
that crosses a suspend point).

Overall idea is that a coroutine is presented to an llvm as an ordinary function
with suspension points marked up with intrinsics. We let the optimizer party
on the coroutine for awhile. Shortly before the coroutine is eligible to be
inlined into its callers, we outline parts of the coroutine that correspond to
the code that needs to get executed after the coroutine is resumed or destroyed.
The coroutine resumption intrinsics get replaced with direct calls to those
outlined functions where possible. Then inliner can inline much leaner and nicer
coroutine or any of the outlined parts as desired.

If we discover that the lifetime of a coroutine is fully enclosed in
the lifetime
of the caller, we remove dynamic allocation for coroutine frame and replace it
with an `alloca` on the caller's frame.

Please see doc/Coroutines.rst for more details:

doc/Coroutines.rst: http://reviews.llvm.org/D21170
IR/Intrinsics.td:   http://reviews.llvm.org/D21169

Concerns:
=========
(This section assumes that you looked at doc/Coroutins.rst and IR/Intrinsics.td)

Invoke like intrinsics via call + conditional branch:
-----------------------------------------------------

The `llvm.experimental.coro.fork` and `llvm.experimental.coro.suspend`
intrinsics model behavior somewhat similar to the invoke instruction. Namely,
they are used to show a normal continuation branch and an alternative branch.

They are used in the following pattern:

    %where.to = call i1 @llvm.experimental.coro.XXXX()
    br i1 %where.to, label %normal, label %alternative

My concern is that this is somewhat fragile. After optimizations some code can
sneak in between the intrinsic and the branch and cause trouble. I am looking
for suggestions on how to make it more robust. One thought I had was to put some
intrinsics at the beginning of the %normal and %alternative blocks so as to
prevent code motion above them. Something like:

  ...
    %where.to = call i1 @llvm.experimental.coro.XXXX()
    br i1 %where.to, label %normal, label %alternative
  normal:
   <phi>
   call i1 @llvm.experimental.coro.normal();
  ...
  alternative:
   <phi>
   call i1 @llvm.experimental.coro.alt();
  ...

After coro.xxx is lowered, coro.normal and coro.alt are removed.

Looking up outlined coroutine parts
-----------------------------------

Last parameter of the `llvm.experimental.coro.init" intrinsic is a pointer to
a coroutine function. To get to the outlined parts, I get the name of that
function and glue ".resume", ".destroy" or ".cleanup" suffixes to it and then
look them up by name.

Probably a better alternative is for a coroutine f to create a constant
containing an array of pointers to f.resume, f.destroy and other parts of the
coroutine and use the last parameter to refer to that constant.

Other questions
---------------

I have more questions about the best way of plugging in coroutine optimization
passes into the PassManager, but, I will defer those for later discussion.

All the feedback and comments is greatly appreciated!!!

Thank you,
Gor

References:
===========

[1] http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0057r4.pdf
[2] http://llvmweekly.org/issue/95 (Coroutine support added to clang)
[3] http://www.boost.org/doc/libs/1_61_0/libs/coroutine/doc/html/index.html

Review Links:
=============

doc/Coroutines.rst: http://reviews.llvm.org/D21170
IR/Intrinsics.td:   http://reviews.llvm.org/D21169