[llvm-dev] [RFC] LLVM Coroutines

Wed Jun 15 07:10:37 PDT 2016

Hi Sanjoy, Eli:

I was thinking a little bit more about your comments related to optimizer
moving code up and down and I think I can generalize the problem.

Background
==========

A coroutine is transformed into a state machine just before it becomes eligible
to be inlined into its caller, so that the body of the coroutine is simplified
as much as possible before we decide what needs to go into a coroutine frame
and before we separate the coroutine into start, resume and destroy parts in
the CoroSplit pass.

      SCC_Inliner - FuncPass - FuncPass - ... - SCC_CoroSplit

The Problem
===========

After we form the coroutine frame we replace 'uses' with loads from
the coroutine
frame and put spills at the 'defs' for those values that must be on the
coroutine frame. Before we do this, the optimizer has no idea
that 'defs' and 'uses' have any relationship with the code that allocates and
frees the coroutine frame and this unexpressed dependency is the root
of the problem.

IR emitted from the frontend can be thought of having several parts separated
by coro.init, coro.fork and coro.delete intrinsics.

        Allocate Frame Code
    ------ coro.init --------------------- [1]
        Init Code
    ------ coro.fork --------------------- [2]
            |        \
            |         \
            |          Body Code
            |     ---- coro.delete ------- [3]
            |       Deallocate Frame Code
            |     ------ coro.end --------
            |
            |
        Return Block Code

Before CoroSplit runs, we need to make sure that:

1. No instruction from below coro.init can move above it as coroutine
   frame is not formed yet.

2. No instructions from "Init Code" can move into the return Block code, since
   Init Code may access the coroutine frame, but, it may be already deleted
   when we get to the return block

3. No instructions from above coro.delete can move below coro.delete, since
   after coro.delete coroutine frame is deallocated.

The Solution
============

Eli suggested it earlier and now I am dialing it up to 11.
Code as emitted by the frontend is in the correct order. We will outline parts
of the coroutine in CoroEarly pass that runs at EP_EarlyAsPossible to prevent
any code motion across the barriers described earlier.

Probably the parts need to be marked as noinline, so that the inliner
won't put them back together until CoroSplit will do the transformation.
In CoroSplit, I may inline them back as appropriate.

I will think more and experiment with this approach.
Any thoughts and suggestions are appreciated.

Gor

On Tue, Jun 14, 2016 at 6:22 PM, Gor Nishanov <gornishanov at gmail.com> wrote:
> Hi Sanjoy,
>
>>> I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
>>> setjmp, in that it can "return twice"?
>
> Yes, user-mode stack switching API are somewhat similar to setjmp. Here are
> links to a doc page and implementation, just in case you are curious:
>
> http://www.boost.org/doc/libs/1_59_0/libs/context/doc/html/context/context.html
> http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_ms_pe_masm.asm
> http://svn.boost.org/svn/boost/trunk/libs/context/src/asm/jump_x86_64_sysv_elf_gas.S
>
>>> If so, I'd urge you to try to not rely on that kind of behavior. LLVM has
>>> unresolved issues around modeling setjmp like behavior
>
> Absolutely! This was just a thought experiment how one could implement the
> intrinsics in a library. One of the benefits of compiler based coroutines is
> that they allow to avoid all of that mess. Suspend is just a 'ret void',
> resume is simply a direct (or indirect) call and a coroutine state is tiny
> (assuming you don't use 64K arrays as local variables :-) )
>
>>>    DeleteBB:
>>>        (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>>>        free(%mem)
>>>        (*%t) = 20 ; // undef: as the fiber memory is freed already
>>>        coro.end(true)
>>>        UNREACHABLE
>>>
>>> Say there was no (*%t) = 20. There is nothing so far that prevents
>>> sinking "(*%t) = 15" from before the free (where it is legal) to after
>>> the free (where it is not).  %t is an unescaped alloca, and LLVM can
>>> move around loads from and stores to these (as a first approximation)
>>> as it pleases.
>
> Good point. During CoroSplit I should sink CoroutineFrame deallocation
> all the way down to coro.end().
>
>>> So when you say "some_code_creating_T()" are there restrictions on
>>> what kind of code there could be as part of the creation logic (I've
>>> so far assumed that it can be arbitrary code).
>
> You are right. It *is* arbitrary code. In my reply to you earlier I had a
> particular use case in mind, where you should not touch anything in the
> coroutine frame in the return block, but that is higher level semantics that
> frontend or library imbues coroutine with, not something that LLVM enforces.
>
> The case I had in mind was that a frontend generated an asynchronous coroutine
> without RAII semantics. Once launched it will run driven by asynchronous
> completions and when it runs to the end it will destroy itself.
>
> Such a coroutine can get suspended, get resumed by a different thread, run to
> completion and free the coroutine frame **before** the thread that created
> the coroutine manages to get to its return block. In such a coroutine reading
> or writing to coroutine frame in return block is undefined behavior.
>
> On the other hand, if it is a synchronous generator that always starts
> suspended, then any writes or reads from a coroutine frame in the
> return block are perfectly fine.
>
> Since LLVM does not know what kind of coroutine user is trying to build, it
> should take a conservative approach. LLVM should not introduce writes or reads
> to a coroutine frame in the return block if they weren't there, but, if
> frontend put them there, LLVM should not replace them with
> `@llvm.traps` and `undef`s either.
>
> Gor
>
> On Tue, Jun 14, 2016 at 1:25 PM, Sanjoy Das
> <sanjoy at playingwithpointers.com> wrote:
>> Hi Gor,
>>
>> On Sun, Jun 12, 2016 at 6:40 PM, Gor Nishanov <gornishanov at gmail.com> wrote:
>>> Quick answer: folding %val to 10 is OK, but, I would prefer it to be 'undef'
>>> or even catch it in the verifier as it is a programmer/frontend error.
>>
>> Ok.
>>
>>>>> The thought experiment relevant here is that **could** you implement
>>>>> your semantics with reasonable bodies for the llvm.coro intrinsics?
>>>
>>> I can emulate **control flow** very closely with user mode thread switching
>>> APIs like make_context, jump_context. Memory behavior not so much.
>>>
>>> Library emulation
>>> =================
>>>
>>> Let's build a fiber abstraction out of make_context, jump_context.
>>>
>>> i1 fiber_fork(i8* %mem) - clones this function into a fiber
>>>                           returns 0 - in a fiber
>>>                           returns 1 - in a thread
>>>                           uses %mem to store fiber context + fiber stack
>>
>> I'm not familiar with fiber-type APIs, but I assume fiber_fork is like
>> setjmp, in that it can "return twice"?  If so, I'd urge you to try to
>> not rely on that kind of behavior.  LLVM has unresolved issues around
>> modeling setjmp like behavior, and if possible, it is preferable to
>> make all of the control flow explicit from the very beginning.
>>
>>> i1 fiber_suspend()      - suspends current fiber, switches back to thread
>>>                           stores the fiber context to be able to resume later
>>
>> This looks like a normal function call, except that the "return" may
>> be arbitrarily spaced out from the call?
>>
>>> void fiber_join(i8* %mem i1 %val) - switches to fiber stack identified by %mem,
>>>                           fiber_suspend will return %val when resumed
>>>
>>> void fiber_end() noreturn
>>>                         - switches back to a thread, does not remember fiber
>>>                           context and does not touch %mem.
>>>                           (Thus you can use this after %mem is freed)
>>>
>>>
>>> Detail: fiber_fork(i8* %mem)
>>> ----------------------------
>>>   * Prepares fcontext in %mem utilizing the rest of the memory for stack.
>>>   * Saves current thread context on a current thread stack.
>>>     (We should not store if in %mem, otherwise we won't be
>>>      able to switch back after fiber memory is freed)
>>
>> As I said above, these following two points can be problematic:
>>
>>>   * Sets up **current thread** context, so that when resumed it will proceed
>>>     as if fiber_fork returned true
>>>   * Sets up **fiber** context so that when resumed, the fiber will proceed
>>>     as if fiber_fork returned false.
>>
>>>
>>> Okay, let's map it to our intrinsics:
>>>
>>> coro.fork == fiber_fork
>>> coro.suspend == fiber_suspend
>>> coro.end(true) == fiber_end
>>> coro.resume = fiber_join(%mem, i1 false)
>>> coro.destroy = fiber_join(%mem, i1 true)
>>> coro.size == 1,000,000 (or some other big number)
>>>
>>> Now let's walk through this example in the fiber model:
>>>
>>>    T coro() {
>>>        %t = alloca
>>>        %mem = malloc(...);
>>>        (*%t) = 10
>>>        if (coro.fork()) { // switches to fiber at label Start
>>>    ReturnBB:
>>>          %val = *%t // will be 10, as fiber has its own copy
>>>          return some_code_creating_T(%val);
>>>        }
>>>    Start:
>>>        Body Of The Coroutine
>>>        With Suspends and All The Fun
>>>    DeleteBB:
>>>        (*%t) = 15 ; // fine, copy of %t on a fiber stack updated
>>>        free(%mem)
>>>        (*%t) = 20 ; // undef: as the fiber memory is freed already
>>
>> Say there was no (*%t) = 20. There is nothing so far that prevents
>> sinking "(*%t) = 15" from before the free (where it is legal) to after
>> the free (where it is not).  %t is an unescaped alloca, and LLVM can
>> move around loads from and stores to these (as a first approximation)
>> as it pleases.
>>
>>>        coro.end(true) // switches back to the thread at ReturnBB
>>>        UNREACHABLE // <== never reaches here. So we are honest with optimizer
>>>      }
>>>
>>>
>>> Back to the LLVM based implementation
>>> =====================================
>>>
>>> In your example, there will be a path from the block where %t is used to
>>> where %t is defined that crosses a suspend point. Therefore %t will be part of
>>> the coroutine frame. So before the split, the program will be transformed into:
>>>
>>>    coro.frame = type { int %t }
>>>
>>>    T coro() {
>>>        %t = alloca
>>>        %mem = malloc(...);
>>>        %s = (coro.frame*)coro.init(%mem)
>>>        s->%t = 10
>>>        if (coro.fork()) {
>>>    ReturnBB:
>>>          %val = 10 // const.prop as 10, OK, but really undef
>>>          return some_code_creating_T(%val);
>>>        }
>>>        Body Of The Coroutine
>>>        With Suspends and All The Fun
>>>    DeleteBB:
>>>        free(%mem)
>>>        s->%t = 20 //// BOOM writing to freed memory
>>>        coro.end(true)
>>>        UNREACHABLE
>>>      }
>>>
>>> Accessing anything in the coroutine frame after coro.fork "true" is undefined.
>>> As by the time you get to returnBB, the coroutine frame could already
>>> be destroyed.
>>
>> So when you say "some_code_creating_T()" are there restrictions on
>> what kind of code there could be as part of the creation logic (I've
>> so far assumed that it can be arbitrary code).  If it can't
>> (semantically) touch anything in the stack frame then why have it as
>> part of @coro() ?
>>
>> -- Sanjoy