[llvm-dev] [RFC] Asynchronous unwind tables attribute

Sat Nov 20 11:12:36 PST 2021

On 2021-11-20, Momchil Velikov wrote:
>On Sat, 20 Nov 2021 at 08:26, Fāng-ruì Sòng <maskray at google.com> wrote:
>> > Asynchronous unwind tables could also restrict code generation to
>> > having only a finite number of frame pointer adjustments (an example
>> > of *not* having a finite number of `SP` adjustments is on AArch64 when
>> > untagging the stack (MTE) in some cases the compiler can modify `SP`
>> > in a loop).
>>
>> The restriction on MTE is new to me as I don't know much about MTE yet.
>
>It has nothing to do with MTE per se, I just noticed it in an MTE test
>(`llvm/test/CodeGen/AArch64/settag.ll:stg_alloca17()`).
>I've got a patch for this, that just uses an extra scratch register (that's
>in
>the epilogue before popping CSRs and we have plenty of registers) and a
>constant
>number (usually one) of SP adjustments by a constant.
>
>> > We could, for example, extend the `uwtable` attribute with an optional
>> > value, e.g.
>> >   -  `uwtable` (default to 2)
>> >   -  `uwtable(1)`, sync unwind tables
>> >   -  `uwtable(2)`, async unwind tables
>> >   -  `uwtable(3)`, async unwind tables, but tracking only a subset of
>> > registers (e.g. CFA and return address)
>> >
>> > Or add a new attribute `async_uwtable`.
>> >
>> > Other suggestions? Comments?
>>
>> I have thought about extending uwtable as well. In spirit the idea
>> looks great to me.
>> The mode removing most callee-saved registers is useful.
>> For example, I think linux-perf just uses pc/sp/fp (as how its ORC
>> unwinder is designed).
>>
>> My slight concern with uwtable(3) is that the amount of unwind
>> information is not monotonic.
>> Since sync->async and the number of registers are two dimensions,
>> perhaps we should use two function attributes?
>
>I reckon this matters when combining (for whatever reasons) multiple
>`uwtable` attributes?
>Indeed, in my first version, I dropped the encoding 3 and then I was able
>to synthesize
>the attribute for an outlined function by simply taking the max of the
>attribute
>in the outlined-from functions - it was just simpler.
>
>How about instead we exchange the meaning of 2 and 3 so we get
>  - 1, sync unwind tables
>  - 2, "minimal" async unwind tables
>  - 3, full async unwind tables
>Then on the principle that we should always emit CFI information that the
>`uwtable` requested
>(as it may be an ABI mandate), possibly optimised, depending on the
>`nounwind` attribute, we would get:
>
>          | nounwind 0           |  nounwind 1
>----------+----------------------+--------------
>uwtable 0 | sync, full           |  no CFI
>----------+----------------------+--------------
>uwtable 1 | sync, full           |  sync, full
>----------+----------------------+--------------
>uwtable 2 | async, full prologue,|
>          | mininal epilogue     |  async, min
>----------+----------------------+--------------
>uwtable 3 | async, full          |  async, full
>
>as a starting point, and then backends may choose any of the entries
>in the following rows of the same column, as a QOI decision.
>
>All that said, I'm not even entirely convinced we need it as a separate
>`uwtable` option.
>The decision to skip some of the CFI instructions can be made during final
>object encoding.
>It probably has to be made during the final encoding, e.g. no point
>including epilogue CFI instructions
>in `.eh_frame`, or an ORC generator would naturally ignore most CFI
>instructions anyway.

I have trouble understanding "min" in the uwtable 2 row.
What does it mean?

As IR attributes, I'd hope the behavior is strictly monotonic in every
dimensions. I.e. If uwtable A provides some information, I'd expect that
uwtable B provides at least the same amount of information if A < B.

We may end up with too many dimensions.
For such rare ones (e.g. the number of registers), I think it is
entirely fine to omit it from IR attributes and make it an
llvm::TargetOptions::* option, like -ffunction-sections. We just lose
fine-grained LTO merge behavior which is probably not needed at all.

>> BTW, are you working on improving the general CFI problems for aarch64?
>Yeah, I'm implementing support for `-fasynchronous-unwind-tables`. A
>slightly outdated
>series of patches start from https://reviews.llvm.org/D112330
>
>The full list I have right now is:
>* [AArch64] Async unwind - Fix MTE codegen emitting frame adjustments in a
>loop
>  - this fixes the issue described above
>* [AArch64] Async unwind - Adjust unwind info in AArch64LoadStoreOptimizer
>  - this fixes some case(s) where load/store optimiser moves an SP inc/dec
>after the matching CFI instruction
>* [CodeGen] Async unwind - add a pass to fix CFI information
>  - this is a pass that inserts `.cfi_remember_state`/`.cfi_restore_state`,
>ideally should work
>    for all targets and replace `CFIInstrInserter`
>* [AArch64] Async unwind - function epilogues
>* [AArch64] Async unwind - function prologues
>  - these are the core functionality
>* [AArch64] Async unwind - Refactor generation of shadow call stack
>prologue/epilogue
>* [AArch64] Async unwind - Always place the first LDP at the end when
>ReverseCSRRestoreSeq is true
>* [AArch64] Async unwind - helper functions to decide on CFI emission
>  - the three above: preparation/refactoring/simplification, `emitEpilogue`
>especially is a big mess
>* [AArch64] Async unwind - do not schedule frame setup/destroy
>* Extend the `uwtable` attribute with unwind table kind
>(I was meaning to update it for a few days now, only always something else
>pops up ...)

Thanks:)

>> Since we are discussing asynchronous unwind tables, may I ask two
>> slightly off-topic things?
>>
>> (1) What's your opinion on ld --no-ld-generated-unwind-info?
>
>I would say, from a design point of view, an unwinder of any kind should
>not analyse and interpret machine
>instructions as it's, in the general case, fragile - that's been my
>experience from developing and maintaining
>an unwinder that analysed prologues/epilogues, over a period of 10+ years,
>each new compiler version required adjustments.
>
>Then, PLT entries are likely to be a special case as they are both tiny and
>extremely unlikely to change between
>different compilers or different compiler versions. In a sense, one can
>treat them as having implicit identical unwind
>table entries (of any unwind table kind) associated with their address
>range, therefore explicit entries in the regular
>unwind tables are superfluous.

Thanks for the comments.

>> (2) How should future stack unwinding strategy evolve?
>Well, that's a good question ... :D

* compact unwind scheme.
* hardware assisted. Piggybacking on security hardening features like shadow call stack. However, this unlikely provides more information about callee-saved registers.
* mainly FP-based. People don't use FP due to performance loss. If `-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer` doesn't hurt performance that much, it may be better than some information in `.eh_frame`. We can use unwind information to fill the gap, e.g. for shrink wrapping.

If we go with the FP route, the expectation from .eh_frame may be lower.
We just need it to fill the gap.
We can probably indicate the intention with a codegen only llvm::TargetOptions option as well.

>~chill
>
>-- 
>Compiler scrub, Arm