[PATCH] D157070: [C++20] [Coroutines] Introduce `@llvm.coro.opt.blocker` to block the may-be-incorrect optimization for awaiter

Thu Aug 10 19:52:53 PDT 2023

ChuanqiXu added a comment.

In D157070#4577996 <https://reviews.llvm.org/D157070#4577996>, @rjmccall wrote:

> Hmm.  This is quite interesting.  So we've got two things going on that aren't quite kosher from an LLVM perspective:
>
> - The `alloca`s in the coroutine are potentially deallocated whenever there's a suspend.  Normally this isn't a problem because code in the coroutine can't run when that happens, but:
> - The call to `await_suspend` is not really part of the coroutine execution, and this is really sneakily important in a lot of ways.

Yes, exactly. And for the second point, it is stated formally that the coroutine is considered as suspended after `await_ready` returns false: http://eel.is/c++draft/expr.await#5.1. And we already **considered** this when designing coroutines intrinsics: this is the job of `@llvm.coro.save`: https://llvm.org/docs/Coroutines.html#llvm-coro-save-intrinsic. I feel we just had an oversight before.

> One of those ways that I, at least, hadn't realized before: the local variables of the `await_suspend` call *must not be allocated into the coroutine frame*, because otherwise they can be deallocated before they're allowed to be.  This works out today because inlining adds lifetime markers to the `alloca`s of the inlined function, and coroutine frame lowering detects when an `alloca` has lifetime markers that don't cross a suspend and leaves it as an `alloca` in the split function.  But with `await_suspend` we're semantically reliant on that; as long as we can inline `await_suspend` into an unlowered coroutine, then if we have any gaps in that "optimization" at all, it can cause a miscompile.

Yes and I don't feel it is scary. The semantics here are clear. So if a miscomputation of such cases happens, it implies that there is a bug either in CoroFrame or in the inlining pass. We'll know how to handle them.

> The awaiter temporary has to go into the coroutine frame: its formal lifetime lasts until the end of the full-expression in the coroutine that contains the `co_await`, and of course we also have to call `await_resume()` after resuming from suspension.  So necessarily `await_suspend` has to handle the `this` object being destroyed during the execution of the method: e.g. anything it does after enqueuing the coroutine to be asynchronously resumed is unordered (racing) with the destruction of `this`.  If we inline `await_suspend` — which in general we'd really like to do because a lot of its data flow is just local to the coroutine — we have to be careful about how that's optimized, which ends up being the direct bug here.  But we *also* have to be careful about how anything it might store a reference to might be optimized: the awaiter could captured a pointer to some other local variable from the coroutine, and that variable will be also apparently be deallocated during its `alloca` lifetime.

Yes and while I don't think we have any gaps here, I just want to note that there cases we can choose to not put the awaiter to the coroutine frame: in case we inlined `await_ready`, `await_suspend` and `await_resume` and there is no non-static data members  used in `await_resume`.
Such cases are qutie common. e.g., the `std::await_suspend`; Of course, it may be always safe to put the awaiter to the coroutine frame.

> If marking an `alloca` as escaping (as your current patch does) is really sufficient to fix this bug, we should be good.  For example, it implicitly fixes the problem with captured pointers to other `alloca`s just by the nature of transitive escapes: if a pointer to the awaiter can escape, and the awaiter may be storing a pointer to some other variable A, then the pointer to A can also escape.  I'm concerned that marking an `alloca` as escaping might not be sufficient, though.  The optimizer does have to be a lot more conservative about optimizing accesses to an escaped `alloca`, but there are still *some* optimizations it can do because they can't be detected by well-behaved code.  For example, if the optimizer sees a conditional store to an `alloca` followed by a return, it could certainly still make it unconditional (or eliminate it entirely) — the memory is guaranteed to exist, and there's no valid way to observe the store because there are no memory accesses or synchronizations before it goes out of scope.  In the example, what actually blocks the optimization is that the store is followed by an intrinsic call, which the optimizer has to pessimistically assume can observe the store through the escaped pointer.  So I'm worried that just marking the variable as having escaped isn't sufficient to prevent all mis-optimization here, and the analysis of why it probably won't happen seems really brittle.  Maybe it's fine, though.

I think we need to believe in other optimizations for the cases you mentioned. Otherwise, not only the design of LLVM coroutines is potentially broken, I feel the LLVM itself is not stable too. Since the cases you said are not limited to coroutines.  So I feel we had to believe other optimizations and if we meet any other bugs, let's fix them.

> I'm wondering if we might need to keep the `await_suspend` call outlined somehow until after coroutine splitting.

This works. But I feel this is more or less a policy issues. I mean the reason why we put some of the coroutine functionalities in the middle end is that we want to get benefits from the middle end optimizations. And in fact, we got it. We got much better performances than GCC within coroutines. Of course, the other side of the decision is that we meet many more middle end bugs than GCC. And in fact, there are many bugs can be fixed automatically by putting CoroSplit pass in the front of some other certain optimization passes. But this is what we didn't do. We always tried to find the solution with minimal impact.

> What does the coroutine IR actually look like here?  We must be setting up the coroutine frame for the suspension before the call to `await_suspend`, or else it won't be properly ordered before whatever dispatch we do there; but the actual suspension point must come *after* the call to `await_suspend`, or the optimizer's understanding of the control flow will be completely messed up.

The actual IR generated here before inlining is:

  await.suspend:                                    ; preds = %init.ready
    %10 = call token @llvm.coro.save(ptr null)
    call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp11) #3
    %call14 = call ptr @_ZNSt7__n486116coroutine_handleIN6MyTask12promise_typeEE12from_addressEPv(ptr noundef %4)
    %call18 = call ptr @_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 8 dereferenceable(9) %ref.tmp7, ptr %call14)
    store ptr %call18, ptr %ref.tmp11, align 8
    %call20 = call noundef ptr @_ZNKSt7__n486116coroutine_handleIvE7addressEv(ptr noundef nonnull align 8 dereferenceable(8) %ref.tmp11) #3
    call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp11) #3
    %11 = call ptr @llvm.coro.subfn.addr(ptr %call20, i8 0)
    call fastcc void %11(ptr %call20) #3
    %12 = call i8 @llvm.coro.suspend(token %10, i1 false)
    switch i8 %12, label %coro.ret [
      i8 0, label %await.ready
      i8 1, label %cleanup21
    ]

By the semantics of https://llvm.org/docs/Coroutines.html#llvm-coro-save-intrinsic, the call to `@llvm.coro.save` implies that we've entered the suspension state. And actually, the `@llvm.coro.save` will be lowered into an update of the index of the coroutine frame. So as far as I understood, it looks fine basically.

And for the issue itself, the IR after inlining looks like:

  await.suspend:                                    ; preds = %init.ready
    %10 = call token @llvm.coro.save(ptr null)
    call void @llvm.lifetime.start.p0(i64 8, ptr nonnull %ref.tmp11) #3
    %suspended.i = getelementptr inbounds %struct.Awaiter, ptr %ref.tmp7, i64 0, i32 1
    store i8 1, ptr %suspended.i, align 8, !tbaa !5
    %11 = load ptr, ptr %ref.tmp7, align 8, !tbaa !11
    %call.i = call ptr @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 1 dereferenceable(1) %11, ptr %4) #3
    %tobool.i.not.i = icmp eq ptr %call.i, null
    br i1 %tobool.i.not.i, label %if.then.i, label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

  if.then.i:                                        ; preds = %await.suspend
    store i8 0, ptr %suspended.i, align 8, !tbaa !5
    br label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

  _ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit: ; preds = %await.suspend, %if.then.i
    %retval.sroa.0.0.i = phi ptr [ %4, %if.then.i ], [ %call.i, %await.suspend ]
    store ptr %retval.sroa.0.0.i, ptr %ref.tmp11, align 8
    %12 = load ptr, ptr %ref.tmp11, align 8, !tbaa !12
    call void @llvm.lifetime.end.p0(i64 8, ptr nonnull %ref.tmp11) #3
    %13 = call ptr @llvm.coro.subfn.addr(ptr %12, i8 0)
    call fastcc void %13(ptr %12) #3
    %14 = call i8 @llvm.coro.suspend(token %10, i1 false)
    switch i8 %14, label %coro.ret [
      i8 0, label %await.ready
      i8 1, label %cleanup21
    ]

  await.ready:                                      ; preds = %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit, %init.ready
    %suspended.i43 = getelementptr inbounds %struct.Awaiter, ptr %ref.tmp7, i64 0, i32 1
    %15 = load i8, ptr %suspended.i43, align 8, !tbaa !5, !range !14, !noundef !15
    %tobool.not.i = icmp eq i8 %15, 0
    br i1 %tobool.not.i, label %if.then.i44, label %_ZN7Awaiter12await_resumeEv.exit

Here the `%11` is the `this` pointer of the awaiter. And the IR is correct so far. The potentially on-the-frame variable `%suspended.i` (awaiter->suspended) may only be accessed if the coroutine is not destroyed. The semantics comes from the programmer. Then the optimizer found that the value of the variable  `%suspended.i (awaiter->suspended)` is consistent with nullness of the return value of `_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE ` and the awaiter is not escaped. So the optimizer choose to optimize the variable  `%suspended.i (awaiter->suspended)` out.

  await.suspend:                                    ; preds = %init.ready
    %9 = call token @llvm.coro.save(ptr null)
    %call.i = call ptr @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 1 dereferenceable(1) %7, ptr %4) #3
    %tobool.i.not.i = icmp eq ptr %call.i, null
    br i1 %tobool.i.not.i, label %if.then.i, label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

  if.then.i:                                        ; preds = %await.suspend
    br label %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit

  _ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit: ; preds = %await.suspend, %if.then.i
    %ref.tmp7.sroa.5.0 = phi i8 [ 0, %if.then.i ], [ 1, %await.suspend ]
    %retval.sroa.0.0.i = phi ptr [ %4, %if.then.i ], [ %call.i, %await.suspend ]
    %10 = call ptr @llvm.coro.subfn.addr(ptr %retval.sroa.0.0.i, i8 0)
    call fastcc void %10(ptr %retval.sroa.0.0.i) #3
    %11 = call i8 @llvm.coro.suspend(token %9, i1 false)
    switch i8 %11, label %coro.ret [
      i8 0, label %await.ready
      i8 1, label %cleanup21
    ]

  await.ready:                                      ; preds = %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit, %init.ready
    %ref.tmp7.sroa.5.1 = phi i8 [ %8, %init.ready ], [ %ref.tmp7.sroa.5.0, %_ZN7Awaiter13await_suspendENSt7__n486116coroutine_handleIvEE.exit ]
    %tobool.not.i = icmp eq i8 %ref.tmp7.sroa.5.1, 0
    br i1 %tobool.not.i, label %if.then.i44, label %_ZN7Awaiter12await_resumeEv.exit

Then after a trivial optimization (SimplifyCFG):

  coro.init:
    %5 = call token @llvm.coro.save(ptr null)
    %call.i = call ptr @_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE(ptr noundef nonnull align 1 dereferenceable(1) %ref.tmp8, ptr %4) #3
    %tobool.i.not.i = icmp eq ptr %call.i, null
    %ref.tmp7.sroa.5.0 = select i1 %tobool.i.not.i, i8 0, i8 1
    %retval.sroa.0.0.i = select i1 %tobool.i.not.i, ptr %4, ptr %call.i
    %6 = call ptr @llvm.coro.subfn.addr(ptr %retval.sroa.0.0.i, i8 0)
    call fastcc void %6(ptr %retval.sroa.0.0.i) #3
    %7 = call i8 @llvm.coro.suspend(token %5, i1 false)
    switch i8 %7, label %coro.ret [
      i8 0, label %await.ready
      i8 1, label %cleanup21
    ]

  await.ready:                                      ; preds = %coro.init
    %tobool.not.i = icmp eq i8 %ref.tmp7.sroa.5.0, 0
    br i1 %tobool.not.i, label %if.then.i44, label %_ZN7Awaiter12await_resumeEv.exit

  if.then.i44:                                      ; preds = %await.ready
    call void @_Z12DidntSuspendv() #3
    br label %_ZN7Awaiter12await_resumeEv.exit

Now the value `%tobool.not.i ` is used across suspend points so that it belongs to the coroutine frame. Then we're accessing the coroutine frame unconditionally after the call to `@_ZN13SomeAwaitable8RegisterENSt7__n486116coroutine_handleIvEE `. Then boom after the coroutine handle got destroyed in that foreign functions. Here is the complete story.

---

So my personal summary is that we've already handled the case of await_suspend by `@llvm.coro.save`. But we didn't leak the awaiter before the `await_suspend` and the other optimization doesn't know the relationship between the awaiter and the coroutine handle. Then here is the problem. So I feel we can solve the issue simply after we mark the awaiter as escaped.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157070/new/

https://reviews.llvm.org/D157070