[PATCH] D97915: [Coroutines] Handle overaligned frame allocation

Sun Mar 7 16:13:40 PST 2021

lewissbaker added inline comments.

================
Comment at: clang/lib/CodeGen/CGBuiltin.cpp:4450
+    Value *Extra = Builder.CreateSelect(Cmp, Diff, Zero);
+    return RValue::get(Builder.CreateAdd(SizeCall, Extra));
   }
----------------
rjmccall wrote:
> lewissbaker wrote:
> > ychen wrote:
> > > rjmccall wrote:
> > > > ychen wrote:
> > > > > ychen wrote:
> > > > > > rjmccall wrote:
> > > > > > > Okay, so you're implicitly increasing the coroutine size to allow you to round up to get an aligned frame.  But how do you round back down to get the actual pointer that you need to delete?  This just doesn't work.
> > > > > > > 
> > > > > > > You really ought to just be using the aligned `operator new` instead when the required alignment is too high.  If that means checking the alignment "dynamically" before calling `operator new` / `operator delete`, so be it.  In practice, it will not be dynamic because lowering will replace the `coro.align` call with a constant, at which point the branch will be foldable.
> > > > > > > 
> > > > > > > I don't know what to suggest if the aligned `operator new` isn't reliably available on the target OS.  You could outline a function to pick the best allocator/deallocator, I suppose.
> > > > > > Thanks for the review!
> > > > > > 
> > > > > > > Okay, so you're implicitly increasing the coroutine size to allow you to round up to get an aligned frame. But how do you round back down to get the actual pointer that you need to delete? This just doesn't work.
> > > > > > 
> > > > > > Hmm, you're right that I missed the `delete` part, thanks. The adjusted amount is constant, I could just dial it down in `coro.free`, right?
> > > > > > 
> > > > > > >  You really ought to just be using the aligned operator new instead when the required alignment is too high. If that means checking the alignment "dynamically" before calling operator new / operator delete, so be it. In practice, it will not be dynamic because lowering will replace the coro.align call with a constant, at which point the branch will be foldable.
> > > > > >  
> > > > > > That's my intuition at first. But spec is written in a way suggesting (IMHO) that the aligned version should not be used? What if the user specify their own allocator, then which one they should use?
> > > > > Sorry, I meant the adjusted amount is coro.align - std::max_align_t,  I could subtract it in `coro.free` . I think it should work?
> > > > No, because the adjustment you have to do in `coro.alloc` isn't just an addition, it's an addition plus a mask, which isn't reversible.  Suppose the frame needs to be 32-byte-aligned, but the allocator only promises 8-byte alignment.  The problem is that when you go to free a frame pointer, and you see that it's 32-byte-aligned (which, again, it always will be), the pointer you got from the allocator might be the frame pointer minus any of 8, 16, or 24 — or it might be exactly the same.  The only way to reverse that is to store some sort of cookie, either the amount to subtract or even just the original pointer.
> > > > 
> > > > Now, if you could change the entire coroutine ABI, you could make the frame handle that you pass around be the unadjusted pointer and then just repeat the adjustment every time you enter the coroutine.  But that doesn't work because the ABI relies on things like the promise being at a reliable offset from the frame handle.
> > > > 
> > > > I think the best solution would be to figure out a way to use an aligned allocator, which at worst does this in a more systematic way and at best can actually just satisfy your requirement directly without any overhead.  If you can't do that, adding an offset to the frame would be best; if you can't do that, doing it as a cookie is okay.
> > > > 
> > > > > That's my intuition at first. But spec is written in a way suggesting (IMHO) that the aligned version should not be used? What if the user specify their own allocator, then which one they should use?
> > > > 
> > > > It seems like a spec bug that this doesn't use aligned allocators even when they're available.  If there's an aligned allocator available, I think this should essentially do dynamically what it would normally do statically, i.e.:
> > > > 
> > > > ```
> > > > void *allocation = alignment > __STDCPP_DEFAULT_NEW_ALIGNMENT__ ? operator new(size, align_val_t(alignment)) : operator new(size);
> > > > ```
> > > > 
> > > > This would ODR-use both allocation functions, of course.
> > > > 
> > > > Maybe it's right to do this cookie thing if we can't rely on an aligned allocator, like if the promise class provides only an `operator new(size_t)`.
> > > > No, because the adjustment you have to do in `coro.alloc` isn't just an addition, it's an addition plus a mask, which isn't reversible.  Suppose the frame needs to be 32-byte-aligned, but the allocator only promises 8-byte alignment.  The problem is that when you go to free a frame pointer, and you see that it's 32-byte-aligned (which, again, it always will be), the pointer you got from the allocator might be the frame pointer minus any of 8, 16, or 24 — or it might be exactly the same.  The only way to reverse that is to store some sort of cookie, either the amount to subtract or even just the original pointer.
> > > 
> > > I got myself confused. This makes perfect sense.
> > > 
> > > > 
> > > > Now, if you could change the entire coroutine ABI, you could make the frame handle that you pass around be the unadjusted pointer and then just repeat the adjustment every time you enter the coroutine.  But that doesn't work because the ABI relies on things like the promise being at a reliable offset from the frame handle.
> > > > 
> > > > I think the best solution would be to figure out a way to use an aligned allocator, which at worst does this in a more systematic way and at best can actually just satisfy your requirement directly without any overhead.  If you can't do that, adding an offset to the frame would be best; if you can't do that, doing it as a cookie is okay.
> > > > 
> > > 
> > > This is very helpful. I'll explore the adding offset to the frame option first. If it is not plausible, I'll use the cookie method. Thanks!
> > > 
> > There was a proposal to extend the coroutine specification with support for the align_val_t overloads of operator new() when allocating coroutine frames. 
> > See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2014r0.pdf
> > 
> > Unfortunately this was not adopted at the time as it was proposed late in the C++20 cycle and there was not yet any implementation experience.
> > 
> > So for now, if the compiler determines that the frame needs to be aligned to a value greater than the default alignment of global operator new it will need to overallocate, align the frame within that buffer and store the offset applied somewhere so that the it can reconstruct the address of the pointer returned from operator new() so that it can pass it to operator delete() on coroutine_handle::destroy().
> > 
> > Note that there was also a meeting to discuss ABI for coroutine frames with the intent that the major coroutines implementations would all (eventually) agree on a compatible coroutine ABI.
> > 
> > The results of this meeting was written up in the doc https://docs.google.com/document/d/1t53lAuQNd3rPqN-VByZabwL6PL2Oyl4zdJxm-gQlhfU/edit?usp=sharing
> > 
> > The end result was that we decided to place any padding needed to align the promise before the resume/destroy function pointers rather than place that padding in-between the function-pointers and the promise. The rationale here being that we can then calculate the address of the promise as a constant offset from the frame address (typically at an offset of two pointers into the frame) rather than the offset being variable depending on the promise type's alignment. This should help building of certain tooling / debuggers / walking async stack-traces etc. as we don't need to know the exact promise type to be able to determine the location of the promise.
> > 
> > The compiler should know exactly how many bytes of padding was added at the start of the frame allocation to get to the frame address and so should be able to translate the coroutine frame address back to the allocation address before destruction - however this also may have an interplay with the support for overaligned frames (which may be required due to overaligned local variables/args and not only based on the promise-type), so I'm mentioning it here.
> > Note that there was also a meeting to discuss ABI for coroutine frames with the intent that the major coroutines implementations would all (eventually) agree on a compatible coroutine ABI.
> 
> Interesting!  Did you consider reaching out to the Itanium C++ ABI group, which has prior expertise in the area of standardizing C++ ABIs?
> 
> > The end result was that we decided to place any padding needed to align the promise before the resume/destroy function pointers rather than place that padding in-between the function-pointers and the promise.
> 
> That is an interesting choice.  Is deriving the address of the promise within the frame without knowing what the promise type is actually something that clients need to do?  It's not like coroutines carry any reflective information of the sort that exceptions do.
> 
> Anyway, okay.  So the function pointers are supposed to be "right-justified" so that the promise comes immediately afterwards, and the address point is supposed to point at the first function pointer.  That is not what Clang implements, or has ever implemented, but I don't foresee any serious problems in adjusting the LLVM coroutine frame layout code to honor that.
> 
> > The compiler should know exactly how many bytes of padding was added at the start of the frame allocation to get to the frame address and so should be able to translate the coroutine frame address back to the allocation address before destruction - however this also may have an interplay with the support for overaligned frames (which may be required due to overaligned local variables/args and not only based on the promise-type), so I'm mentioning it here.
> 
> Yes, this is basically only true of the adjustment you're talking about for the frame header with overaligned promises.  Barring miracles, the only reasonable way to allocate one of these overaligned-promise frames is to round down to the next promise-alignment boundary and then allocate that, and that offset is indeed static given only of the promise type's alignment.  But the larger allocation can still exceed allocator alignment, whether from the promise type or just local coroutine state, and that extra offset down to the allocated pointer will be dynamic.  Fortunately, I don't think going from the frame pointer to the allocated pointer needs is an ABI-exposed operation, since the frame can only be destroyed by the coroutine itself.  That means the details of allocation are essentially entirely implementation-private, and that includes how we adjust the frame pointer for deallocation.  Am I missing something?
> 
> > Unfortunately this was not adopted at the time as it was proposed late in the C++20 cycle and there was not yet any implementation experience.
> 
> Hmm.  The role of implementation experience here would have been to point out that you hadn't considered over-alignment in your specification.  It sounds more like you were running out of time to write the specification and just punted on an issue in the interests of getting the proposal into the standard.  Regardless, it seems to me that the obviously correct design is that, if both an aligned and and an unaligned allocation function is available, it's unspecified which one is called, as long as the matching deallocation function is then called later.  Are you suggesting that we *must not* do that?
> Interesting! Did you consider reaching out to the Itanium C++ ABI group, which has prior expertise in the area of standardizing C++ ABIs?

I don't recall if they were contacted, although I believe we did discuss doing so at the time.
Maybe @GorNishanov remembers?

There were LLVM devs (mainly from Google), GCC compiler devs and MS compiler team involved in the discussion.

> Is deriving the address of the promise within the frame without knowing what the promise type is actually something that clients need to do?

This is something that I have found a desired to be able to do when implementing async stack trace walking.

At the moment I ended up having to store basically two pointers to the parent coroutine-frame - one a coroutine_handle<void> so I can resume the parent coroutine and another pointer to an AsyncStackFrame stored within the promise so that I can walk to the parent frame.

I would ideally only like to have to store the coroutine_handle and be able to determine from its address the address of the coroutine_handle pointing to the next coroutine-frame.
e.g. by assuming that the promise_type has the continuation as the first data-member.

At the moment I can't make this assumption because the offset from the coroutine_handle::address() to the promise might be variable depending on the concrete promise_type's alignment.

> That means the details of allocation are essentially entirely implementation-private, and that includes how we adjust the frame pointer for deallocation. Am I missing something?

I agree with your analysis here.

> The role of implementation experience here would have been to point out that you hadn't considered over-alignment in your specification.

Yes, this issue was identified, albeit fairly late, during the standardisation process as a result of implementation experience and raised and discussed.

> It sounds more like you were running out of time to write the specification and just punted on an issue in the interests of getting the proposal into the standard

The proposal was already in the standard at the time the issue was identified.

The problem was more that there were a couple of options for how to specify it and we didn't have any implementation experience of either way to inform which way should be chosen. So, yes, we effectively punted the decision until later. My preference is the more complicated design of "Option 1" described in P2014.

>  it seems to me that the obviously correct design is that, if both an aligned and and an unaligned allocation function is available, it's unspecified which one is called, as long as the matching deallocation function is then called later

Yes, I think this is pretty close to what "Option 1" describes, although I think it describes a slightly more involved overload resolution for deallocation functions than just "call the matching deallocation function".

> Are you suggesting that we *must not* do that?

The current specification in C++20 only says that `operator new(size_t)` overloads are called. So a change that caused it to call `operator new(size_t, align_val_t)` overloads would be a non-conforming extension to C++20, although possibly one that users would be happy to have.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D97915/new/

https://reviews.llvm.org/D97915