[LLVMdev] GSoC '11: Segmented Stacks

Sun Apr 3 13:39:03 PDT 2011

Hi All!

This is the third iteration of my GSoC proposal, which I'm mailing here
for feedback. I've already posted the proposal on Melange.

The proposal is in two parts. The first, which answers the questions on
application template mentioned on Melange is here [1]. I've pasted the
most relevant part here:

'''
Implement segmented stacks inside LLVM. Once this is implemented,
instead of having to allocate a worst-case (large) amount of contiguous
stack space to each thread, we'll be able to get each thread to allocate
stack space in small atomic blocks, as and when more space is required.
'''

The second part, on which I require the most feedback, addressing the
implementation issues, follows:

SEGMENTED STACKS FOR LLVM
=========================

This describes the implementation details for my GSoC proposal, on
implementing segmented stacks for LLVM. I will primarily be working on
the X86 platform in the duration of my GSoC work.

Blocks
------

Stack space will be allocated in fixed blocks. Blocks are linked doubly
to form a list. Blocks are never freed, and are reused when possible.
Each frame is contained completely inside _one_ block. The converse is
not true.

Each block has a header containing, apart from the next and previous
pointers, the value of the stack pointer and the instruction pointer
before this new block was created. The previous stack pointer needs to
be stored so that it can be correctly restored once the block is no
longer a part of the stack. The IP needs to be stored since the "real"
return address on the stack (i.e. the one that will be popped) will be
hijacked to perform cleanup (mentioned later).

Block sizes need to be a constant power of two, so that checking for
available space is cheap.

Prologue & Cleanup
------------------

Since every stack frame must completely fit inside one block, the
current block needs to have enough space to hold the current function's
stack frame. This will be done in the prologue
(X86FrameLowering::emitPrologue). If the function uses a constant amount
of stack space, that value (known at compile-time) will be used.
Otherwise some worst-case upper limit (8 KiB, say) is chosen as the
"size" of the stack-frame.

If the current block does not have enough space, control branches to a
stub of code (a naked function, setup_block) which

 1. Allocates a new block if required (may not always be needed, since
we're keeping never de-allocating old blocks).
 2. Copies everything addressed SP relative to the new block. Adjusts
the stack pointer.
 3. Saves the real return address in the block header. Modifies the
return address on the stack to point to a cleanup routine, destroy_block.
 4. "Returns" to the original function prologue. Since we can't really
use the stack to communicate the return address, one of the
callee-clobbered registers will be used.

The destroy_block stub restores the stack pointer and the instruction
pointer from the current block header. Since the block sizes are
constant and a power of two, it is always possible to tell the header
and the space left in the current block from the current stack pointer.

Link Time Optimizations
-----------------------

Checking if there is enough space on the current block and making the
conditional jump is expensive, and should be avoided whenever possible.
One paper on this topic (Whole-Program Optimization for Time and Space
Efficient Threads) talks about using the call graph to reason about the
(upper-bound on) stack space required by various functions. For
instance, if the functions F, G and H (each requiring f, g and h bytes
of stack space, respectively) is dominated by the function A (which
requires a bytes of stack space) in the call graph, we only need to
allocate (a + MAX(f, g, h)) bytes of stack space in A's prologue and
then eliminate the check in F, G and H.

This can implemented by dividing the PEI pass into an analyze and an
action pass. The analyze pass just calls calculateCallsInformation for
each MachineFunction. After this, if we are compiling with split-stacks
(i.e. the target supports it, and it is enabled), we run a
CallGraphSCCPass (InferGlobalStackSpace, say) which implements the
algorithm mentioned in the paper. The actual insertion of the prologue
and epilogue code is delegated to a third pass. This (third) pass uses
the data collected by InferGlobalStackSpace to insert appropriately
modified prologues (the epilogue remains the same).

Issues
------

Dividing up the stack into a list of smaller blocks violates one
important constraint: stack addresses are no longer continuous between
function invocations. This can affect a number of things, all of which
I've tried to address:

Unwinding
~~~~~~~~~

Unwinding should not be a problem if the .eh_frame sections are emitted
carefully. Correct DWARF info can be emitted as follows:

Since we know the offset of base of the stack frame from the stack
pointer (or we are maintaining a frame pointer), we can always say
whether the concerned frame is the first frame in the block. If not, all
the previous register values can be computed as usual. Otherwise, we can
add an extra indirection, involving looking up the stack pointer saved
in this block's header.

Frame Pointers
~~~~~~~~~~~~~~

Nothing changes if the function call does not have to create a new
block, even if the function maintains a frame pointer. So all the
changes can be localized in setup_block.

However, in case the frame pointer is being maintained for the function,
and a new block is being allocated, the value at %rbp will also have to
be copied to the new block, and %rbp will have to be changed to point to
this new location. This is so that the FP relative indexing works. %rbp
will then be set to its "correct" value by the epilogue itself. We'll
probably need a new routine (setup_block_fp) for this situation. Since
we're resetting the stack pointer in destroy_block, the epilogue is free
to clobber %rsp with the "wrong" value of %rbp.

Stack Space For Block Setup
~~~~~~~~~~~~~~~~~~~~~~~~~~~

setup_block and setup_block_fp will be written to use the minimum amount
of stack space possible, and /that/ stack space will have to be factored
into the check in the function prologue. So every function prologue
checks for the space it needs _plus_ the space required to run
setup_block (or setup_block_fp) once, by a callee.

Red Zone
~~~~~~~~

The red zone can be usable only if there actually _is_ 128 bytes of
stack space left in the current block. One way the red-zone can be
profitably used is to always add 128 bytes to the stack space required
by a function. Assuming setup_block is written carefully (and needs <
128 bytes of stack space), this should also take care of the previous point.

Varargs
~~~~~~~

When we create a new block, and are copying all the data that is
addressed SP relative, we will not copy the arguments for a vararg
function. We will copy the va_list structure, however, and since the
fields point (correctly) to locations on the previous block anyways,
accessing the arguments should work fine.

Interoperability
~~~~~~~~~~~~~~~~

We can't safely make calls from code using segmented-stacks to code
expecting a vanilla contiguous stack. This can be fixed by the linker by
padding calls into non-segmented-stack code from segmented-stack code
with allocation and de-allocation of a 64 k block. For the linker to be
able to do this, the compiler needs to emit a dummy section into object
files compiled with segmented stacks. The linker will then use as a
flag. Of course, no such thing needs to be done when linking bc files,
since the native code has not been generated yet.

This is exactly the approach mentioned in [2]. However, this approach is
suboptimal, and to get full benefits of segmented stacks, all code needs
to be compiled with segmented stacks.

[1] http://pastebin.com/UDuADcMC
[2] http://gcc.gnu.org/wiki/SplitStacks

-- 
Sanjoy Das
http://playingwithpointers.com