[PATCH] ARM + Thumb Segmented Stacks (v2)

Thu Mar 27 06:56:22 PDT 2014

Hi Alex,

> Thanks for the response Oliver!
> 
> I've responded to your comments inline below, but I wanted to mention
> that I don't think that this patch precludes fully using segmented
> stacks just yet. We have had segmented stack code generation enabled
> in LLVM for as long as I can remember in Rust. We have quite an
> exhaustive test suite that is running on both x86 and ARM, so I
> wouldn't call this addition useless waiting for more implementation.
> 
> I will agree, however, that it's not as 100% rock solid as what I
> imagine gcc's x86 support for segmented stacks is. Many platforms
> still aren't supported (only linux and android are currently
> supported), and the Thumb support is fairly new and hasn't seen the
> same coverage over time as ARM has. I'd love to help improve this
> though!
> 
> > 1) Has the calling convention of __morestack for ARM been
> defined/documented anywhere?
> 
> Not that I know of, sadly. I'm also unaware of any documentation for
> x86/x86-64. Do you know where I should put some small tidbits about
> this documentation?

I'm not sure, but it is probably worth asking on the gcc mailing list to
make sure no-one else is working on this.

Unless someone has a better suggestion, I would suggest putting a comment in
the code to explain the exact calling convention you expect __morestack to
have. This needs to include at least the argument registers, which registers
are preserved and the distance between the call to __morestack and the main
function body.

> > Since rust has now abandoned segmented stacks
> 
> I may have mistakenly led you to believe this earlier, but we have not
> abandoned the segmented stack prologue in rust. We've abandoned the
> architectural part of segmented stacks that is segmented stacks
> (ironic, yes?), but we are still using the prologue to detect stack
> overflow. This is something we plan on changing, but we have no come
> to a good conclusion on what it should change to.
> 
> This means that we currently have a __morestack implementation for ARM
> [1]. This implementation doesn't allocate any more stack space, it
> just jumps to a function which will abort (for various unrelated
> reasons).

Ah, so upstreaming this will get rust one step closer to being able to use
unmodified LLVM.

> > 2) Is there currently an ARM implementation of __morestack? Rust's is
> dead.
> 
> Yes. Above I mentioned the current __morestack implementation, and you
> can find the much older version (which actually did the segmented
> stack bits) in the git history [2]. Just to clarify again, Rust's is
> not dead, it's just not being used for what it was originally
> intended.
> 
> > 3) You have not made any changes to the emission of DWARF info, so it
> seems unlikely that the DWARF stack frame info will accurately describe
> the stack layout. The emission of DWARF stack frame info may have been
> added to LLVM after rust forked it, but we cannot emit incorrect debug
> info.
> 
> I may be missing it, but it appears that x86 doesn't modify any such
> DWARF info either? One of our debug-info tests [3] has a comment
> saying that gdb breaks before the morestack prologue has run, so line
> numbers are used.

As far as I can tell (though I'm not very familiar with x86 assembly), the
x86 version does not change the stack pointer or any callee-saved registers
(at least on the fast-path), so would not need any DWARF call-frame
instructions.

> I'm not super familiar with LLVM's DWARF support for stack frames
> (sorry, I'm just trying my best to upstream the patch!), so could you
> point me in the direction of what needs to be updated?

If you search for CFI_INSTRUCTION in ARMFrameLowering.cpp, you will find all
of the places where we currently emit call frame instructions. There are two
things these instructions do:
* Let the debugger know, at any point in the function, how to calculate the
CFA (Canonical Frame Address). This is defined for ARM as the value of sp at
entry to the current function. This will need doing every time you update
the stack pointer (i.e. once for each push and pop instruction).
* Let the debugger know how to find the value of each register as it would
have been immediately before this function was called. This will need doing
each time you save or restore the value of a callee-saved register.

This should be simple enough to implement, so I'd like it to be done as part
of this patch.

> > 5) GCC can handle variadic functions for x86 and x86_64, and I can
> see no reason why we cannot do this on ARM. We should at least make
> sure that our definition of the interface to __morestack will not
> prevent adding this at a later date.
> 
> LLVM's currently implementation of segmented stacks for x86 does not
> support varargs (so this ARM patch copies the same strategy). The GCC
> docs for split stacks mention varargs functions, and the conclusion is
> that they should basically be compiled differently to reference
> arguments based on an arbitrary pointer rather than the frame pointer.
> 
> This sounds to me like it's the same __morestack interface, though,
> where there's only one argument (the varargs pointer). This would
> involve codegen changes for varargs functions which have split stacks
> enabled to use this new pointer, however.

That's fine, as long as it can be added at a later date.

6) The code emitted could be made more efficient for the fast path. This is
the assembly emitted for a simple function, compiled for ARM with segmented
stacks:

sum:
    push    {r4, r5}
    mrc     p15, #0, r4, c13, c0, #3
    mov     r5, sp
    ldr     r4, [r4, #4]
    cmp     r4, r5
    blo     .LBB0_2

    mov     r4, #0
    mov     r5, #0
    stmdb   sp!, {lr}
    bl      __morestack
    ldm     sp!, {lr}
    pop     {r4, r5}
    bx      lr
.LBB0_2:
    pop     {r4, r5}
    add     r0, r1, r0   // begin normal function body
    bx      lr

The first basic block has a lot of unnecessary copying of values into low
registers. I think the code can be optimised to this:

sum:
    mrc     p15, #0, r12, c13, c0, #3
    ldr     r12, [r12, #4]
    cmp     r12, sp
    blo     .LBB0_2

    push    {r4, r5}
    mov     r4, #0
    mov     r5, #0
    bl      __morestack
    pop     {r4, r5}
    bx      lr
.LBB0_2:
    add     r0, r1, r0   // begin normal function body
    bx      lr

The changes are:
1) Use r12 as a temporary in the first basic block, instead of r4. r12 is
callee-saved, so we do not need to worry about preserving its value.
2) If CompareStackPointer is true, we can use sp directly in the cmp
instruction, saving one mov instruction.
3) For functions where optimisations (1) and (2) can be applied, we do not
need any temporaries in the first basic block. This means we do not need to
push r4 and r5 unless we decide to allocate more stack space, and do not
have to pop them before the function body.
4) I don't think we need to push/pop lr around the call to __morestack.
Looking at rust's implementation of __morestack (the old and current ones),
lr should be preserved by __morestack. Note that this will reduce the
distance between the call to __morestack and the start of the actual
function, so rust's implementation of __morestack will need to be updated to
match.

The above optimisations should also apply to the Thumb version, with the
caveat that this may increase code size slightly (using 32-bit wide
instructions rather than 16-bit). There is one additional optimisation that
can be applied to Thumb:
5) You are currently emitting 32-bit push and pop instructions, where 16-bit
encodings will do. The only push/pop instruction which you are using which
does not have a 16-bit encoding is "ldm.w   sp!, {lr}" (aka push.w {lr}).

Optimisations (4) and (5) should be fairly simple, so I think they should be
done as part of this patch, but the others are a bit more involved, so could
wait until a later patch.

7) I'm slightly confused as to why ARM and Thumb use different methods to
get the stack limit.  It seems to me that this should be decided based on
platform (e.g. linux or android), not instruction set, especially given that
ARM and Thumb code can be freely combined in the same executable.

> [1] -
> https://github.com/mozilla/rust/blob/master/src/rt/arch/arm/morestack.S
> [2] -
> https://github.com/mozilla/rust/blob/86efd9/src/rt/arch/arm/morestack.S
> [3] - https://github.com/mozilla/rust/blob/master/src/test/debug-
> info/function-arg-initialization.rs
[4] -
http://static.rust-lang.org/doc/0.9/complement-usage-faq.html#why-do-gdb-bac
ktraces-end-with-the-error-previous-frame-inner-to-this-frame-corrupt-stack
[5] - http://www.sourceware.org/ml/gdb-patches/2010-11/msg00383.html