[llvm-dev] [RFC] Asynchronous unwind tables attribute
Fāng-ruì Sòng via llvm-dev
llvm-dev at lists.llvm.org
Sat Nov 20 00:26:09 PST 2021
On Wed, Nov 17, 2021 at 3:19 AM Momchil Velikov via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>
> On one hand, we have the `uwtable` attribute in LLVM IR, which tells
> whether to emit CFI directives. On the other hand, we have the `clang
> -cc1` command-line option `-funwind-tables=1|2 ` and the codegen
> option `VALUE_CODEGENOPT(UnwindTables, 2, 0) ///< Unwind tables (1) or
> asynchronous unwind tables (2)`.
> Thus we lose along the way the information whether we want just some
> unwind tables or asynchronous unwind tables.
Thanks for starting the topic. I am very interested in the topic and
would like to see that CFI gets improved.
I have looked into -funwind-tables/-fasynchronous-unwind-tables and
done some relatively simple changes
like (default to -fasynchronous-unwind-tables for aarch64/ppc,
fix -f(no-)unwind-tables/-f(no-)asynchronous-unwind-tables/make
-fno-asynchronous-unwind-tables work with instrumentation,
add `-funwind-tables=1|2 `) but haven't done anything on the IR level.
It's good to see that someone picks up the heavylift work so that I
don't need to do it:)
That said, if you need a reviewer or help on some work items, feel
free to offload some to me.
> Asynchronous unwind tables take more space in the runtime image, I'd
> estimate something like 80-90% more, as the difference is adding
> roughly the same number of CFI directives as for prologues, only a bit
> simpler (e.g. `.cfi_offset reg, off` vs. `.cfi_restore reg`). Or even
> more, if you consider tail duplication of epilogue blocks.
> Asynchronous unwind tables could also restrict code generation to
> having only a finite number of frame pointer adjustments (an example
> of *not* having a finite number of `SP` adjustments is on AArch64 when
> untagging the stack (MTE) in some cases the compiler can modify `SP`
> in a loop).
The restriction on MTE is new to me as I don't know much about MTE yet.
>
> Having the CFI precise up to an instruction generally also means one
> cannot bundle together CFI instructions once the prologue is done,
> they need to be interspersed with ordinary instructions, which means
> extra `DW_CFA_advance_loc` commands, further increasing the unwind
> tables size.
>
> That is to say, async unwind tables impose a non-negligible overhead,
> yet for the most common use cases (like C++ exceptions), they are not
> even needed.
>
> We could, for example, extend the `uwtable` attribute with an optional
> value, e.g.
> - `uwtable` (default to 2)
> - `uwtable(1)`, sync unwind tables
> - `uwtable(2)`, async unwind tables
> - `uwtable(3)`, async unwind tables, but tracking only a subset of
> registers (e.g. CFA and return address)
>
> Or add a new attribute `async_uwtable`.
>
> Other suggestions? Comments?
I have thought about extending uwtable as well. In spirit the idea
looks great to me.
The mode removing most callee-saved registers is useful.
For example, I think linux-perf just uses pc/sp/fp (as how its ORC
unwinder is designed).
My slight concern with uwtable(3) is that the amount of unwind
information is not monotonic.
Since sync->async and the number of registers are two dimensions,
perhaps we should use two function attributes?
>
> ~chill
BTW, are you working on improving the general CFI problems for aarch64?
I tried to understand the implementation limitation in September (in
https://reviews.llvm.org/D109253) but then stopped.
If you have patches, I'll be happy to study them:)
I know there are quite problems like:
(a) .cfi_* directives in prologue are less precise
% cat a.c
void foo() {
asm("" ::: "x23", "x24", "x25");
}
% clang --target=aarch64-linux-gnu a.c -S -o -
...
foo: // @foo
.cfi_startproc
// %bb.0: // %entry
str x25, [sp, #-32]! // 8-byte Folded Spill
stp x24, x23, [sp, #16] // 16-byte Folded Spill
.cfi_def_cfa_offset 32 ////// should be immediately after
the pre-increment str
.cfi_offset w23, -8
.cfi_offset w24, -16
.cfi_offset w25, -32
//APP
//NO_APP
(b) .cfi_* directives (for MachineInstr::FrameDestroy) in epilogue are
generally missing
(c) A basic block following an exit block may have wrong CFI
information (this can be fixed with .cfi_restore)
Most problems apply to all non-x86 targets.
---
Since we are discussing asynchronous unwind tables, may I ask two
slightly off-topic things?
(1) What's your opinion on ld --no-ld-generated-unwind-info?
Mine is https://maskray.me/blog/2020-11-15-explain-gnu-linker-options#no-ld-generated-unwind-info
(2) How should future stack unwinding strategy evolve?
Hardware assisted approach like leveraging shadow call stack?
Making FP more efficient so that user code can leverage
-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer and drop
inefficient (both size and run-time performance) .eh_frame?
Last year I wrote a post
https://maskray.me/blog/2020-11-08-stack-unwinding as I learn stack
unwinding.
I am going to amend it to include my recent thoughts.
More information about the llvm-dev
mailing list