[llvm-dev] [RFC] Asynchronous unwind tables attribute

Sat Nov 20 00:26:09 PST 2021

On Wed, Nov 17, 2021 at 3:19 AM Momchil Velikov via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>
> On one hand, we have the `uwtable` attribute in LLVM IR, which tells
> whether to emit CFI directives. On the other hand, we have the `clang
> -cc1` command-line option `-funwind-tables=1|2 ` and the codegen
> option `VALUE_CODEGENOPT(UnwindTables, 2, 0) ///< Unwind tables (1) or
> asynchronous unwind tables (2)`.
> Thus we lose along the way the information whether we want just some
> unwind tables or asynchronous unwind tables.

Thanks for starting the topic. I am very interested in the topic and
would like to see that CFI gets improved.

I have looked into -funwind-tables/-fasynchronous-unwind-tables and
done some relatively simple changes
like (default to -fasynchronous-unwind-tables for aarch64/ppc,
fix -f(no-)unwind-tables/-f(no-)asynchronous-unwind-tables/make
-fno-asynchronous-unwind-tables work with instrumentation,
add `-funwind-tables=1|2 `) but haven't done anything on the IR level.
It's good to see that someone picks up the heavylift work so that I
don't need to do it:)
That said, if you need a reviewer or help on some work items, feel
free to offload some to me.

> Asynchronous unwind tables take more space in the runtime image, I'd
> estimate something like 80-90% more, as the difference is adding
> roughly the same number of CFI directives as for prologues, only a bit
> simpler (e.g. `.cfi_offset reg, off` vs. `.cfi_restore reg`). Or even
> more, if you consider tail duplication of epilogue blocks.
> Asynchronous unwind tables could also restrict code generation to
> having only a finite number of frame pointer adjustments (an example
> of *not* having a finite number of `SP` adjustments is on AArch64 when
> untagging the stack (MTE) in some cases the compiler can modify `SP`
> in a loop).

The restriction on MTE is new to me as I don't know much about MTE yet.

>
> Having the CFI precise up to an instruction generally also means one
> cannot bundle together CFI instructions once the prologue is done,
> they need to be interspersed with ordinary instructions, which means
> extra `DW_CFA_advance_loc` commands, further increasing the unwind
> tables size.
>
> That is to say, async unwind tables impose a non-negligible overhead,
> yet for the most common use cases (like C++ exceptions), they are not
> even needed.
>
> We could, for example, extend the `uwtable` attribute with an optional
> value, e.g.
>   -  `uwtable` (default to 2)
>   -  `uwtable(1)`, sync unwind tables
>   -  `uwtable(2)`, async unwind tables
>   -  `uwtable(3)`, async unwind tables, but tracking only a subset of
> registers (e.g. CFA and return address)
>
> Or add a new attribute `async_uwtable`.
>
> Other suggestions? Comments?

I have thought about extending uwtable as well. In spirit the idea
looks great to me.
The mode removing most callee-saved registers is useful.
For example, I think linux-perf just uses pc/sp/fp (as how its ORC
unwinder is designed).

My slight concern with uwtable(3) is that the amount of unwind
information is not monotonic.
Since sync->async and the number of registers are two dimensions,
perhaps we should use two function attributes?

>
> ~chill

BTW, are you working on improving the general CFI problems for aarch64?
I tried to understand the implementation limitation in September (in
https://reviews.llvm.org/D109253) but then stopped.
If you have patches, I'll be happy to study them:)

I know there are quite problems like:

(a) .cfi_* directives in prologue are less precise

% cat a.c
void foo() {
  asm("" ::: "x23", "x24", "x25");
}
% clang --target=aarch64-linux-gnu a.c -S -o -
...
foo:                                    // @foo
        .cfi_startproc
// %bb.0:                               // %entry
        str     x25, [sp, #-32]!                // 8-byte Folded Spill
        stp     x24, x23, [sp, #16]             // 16-byte Folded Spill
        .cfi_def_cfa_offset 32   ////// should be immediately after
the pre-increment str
        .cfi_offset w23, -8
        .cfi_offset w24, -16
        .cfi_offset w25, -32
        //APP
        //NO_APP

(b) .cfi_* directives (for MachineInstr::FrameDestroy) in epilogue are
generally missing

(c) A basic block following an exit block may have wrong CFI
information (this can be fixed with .cfi_restore)

Most problems apply to all non-x86 targets.

---

Since we are discussing asynchronous unwind tables, may I ask two
slightly off-topic things?

(1) What's your opinion on ld --no-ld-generated-unwind-info?
Mine is https://maskray.me/blog/2020-11-15-explain-gnu-linker-options#no-ld-generated-unwind-info

(2) How should future stack unwinding strategy evolve?
Hardware assisted approach like leveraging shadow call stack?
Making FP more efficient so that user code can leverage
-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer and drop
inefficient (both size and run-time performance) .eh_frame?

Last year I wrote a post
https://maskray.me/blog/2020-11-08-stack-unwinding as I learn stack
unwinding.
I am going to amend it to include my recent thoughts.