[llvm-dev] Guidance requested: Fixing missed optimization for coroutines

Adrian Vogelsgesang via llvm-dev llvm-dev at lists.llvm.org
Fri Jan 22 16:23:13 PST 2021


Small correction: The link is https://godbolt.org/z/Weod78<https://godbolt.org/z/Weod78> without the trailing “.”. It seems some regex interpreted the “.” from the end of the sentence as part of the URL

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Adrian Vogelsgesang via llvm-dev <llvm-dev at lists.llvm.org>
Date: Saturday, 23. January 2021 at 01:13
To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] Guidance requested: Fixing missed optimization for coroutines
Dear Clang community,

TLDR: I came across a few missed optimization for coroutines; would be happy to contribute a patch/an improvement; need guidance, though, as I am new to coroutines in LLVM

You can find the input C++ program in https://godbolt.org/z/Weod78<https://godbolt.org/z/Weod78>.
The coroutines in that snippet can finish both synchronously and asynchronously. In case a coroutine finishes synchronously, I want to avoid the allocation of the coroutine frame.

Looking at the produced assembly, this snippet shows the following missed optimizations:
1. lines 12 - 14: The call to “constant12() [clone .destroy]” is not devirtualized
2. line 5: The coroutine frame for `constant12` is not elided (in a trivial, simple case without resumption points)
3. line 15-16: the call to “.LNoopCoro.ResumeDestroy” is not devirtualized; it should be de-virtualized and inlined
4. line 5: The coroutine frame for `sum`  is not “conditionally” elided (less trivial)

I already did some digging in the coroutine-related optimization passes and I think I identified the following root causes/solutions


# CoroElide disabled for “own“ coroutine frame

CoroElide.cpp, line 271 (see link [1]) explicitly disables the CoroElide pass for the own coroutine frame. The CoroElide pass currently only modifies CoroIds which were inlined from other coroutines and leaves the own CoroId alone. Due to this, the call to “constant12() [clone .destroy]” is not devirtualized, and the coroutine frame cannot be elided. After removing this check and unconditionally applying the CoroElide to all CoroIds, issues (1) and (2) from my example are fixed.

My question: Is this check necessary because the CoroElide pass would otherwise be incorrect? Or is it a performance optimization, i.e. we didn’t expect the CoroElide to be useful when applied to the function’s own CoroId and hence disabled it in this case?


# “@llvm.coro.subfn.addr” instrinsic not devirtualized if applied to constants

Issue (3), i.e. the call to “.LNoopCoro.ResumeDestroy” not being devirtualized, seems to be due to the usage of the “@llvm.coro.subfn.addr” instrinsic. CoroElide devirtualizes this instrinsic if applied on a “coro.begin” intrinsic. But there is no devirtualization for subfn.addr calls on constants. Afaict, the lowering in CoroCleanup happens too late in the pipeline, such that the remaining passes won’t remove the load.

I see multiple ways to fix this issue:
1. In the CoroEarly pass, lower “coro.destroy” and “coro.resume” directly to the corresponding load, instead of lowering to “coro.subfn.addr”. The normal “memory constant folding pass” (mem2reg? not sure which pass does this…) would see the loads and could constant fold them, thereby devirtualizing the call. Downside: CoroElide would now need to do more complicated pattern matching to identify accesses to the resume/destroy function pointers.
2. In the CoroEarly pass, lower “coro.destroy/resume” to memory operations, *except* if they are applied on a “coro.begin”. If they are applied on a “coro.begin”, keep using “coro.subfn.addr”. Benefit: We can still use mem2reg (?) to devirtualize calls on constant coroutine frames. At the same time, CoroElide can keep using “coro.subfn.addr” to devirtualize non-constant coroutine frames.
3. In the CoroCleanup pass, special case the lowering of “coro.subfn.addr” when applied to constants. In that case, don’t generate loads, but rather produce the corresponding constant. Downside: probably too late in the pipeline such that the de-virtualized function would not be inlined.

My question: Which of those potential ways would be preferred?


# CoroElide does not support to defer the coroutine frame allocation

Issue (4), i.e. that coroutine frame for the function `sum` being allocated unconditionally, seems to be the most challenging to fix.
I am still kind of lost how to even approach this…

My questions: Does anyone of you have an idea how to approach this? Is there maybe even already some relevant literature/research on this topic?


Cheers,
Adrian


[1] https://github.com/llvm/llvm-project/blob/607bec0bb9f787acca95f53dabe6a5c227f6b6b2/llvm/lib/Transforms/Coroutines/CoroElide.cpp#L271<https://github.com/llvm/llvm-project/blob/607bec0bb9f787acca95f53dabe6a5c227f6b6b2/llvm/lib/Transforms/Coroutines/CoroElide.cpp#L271>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210123/c025369e/attachment.html>


More information about the llvm-dev mailing list