[LLVMdev] RFC: GSoC Project

Mon Mar 28 10:58:05 PDT 2011

Hi!

Thanks for the review. What follows is a revised version of the
proposal. I'll post this on Melange after getting some more feedback.

Segmented stacks for LLVM, GSoC '11
-----------------------------------

* Overview: Goal & Rationale
  Support segmented (split) stacks in LLVM.

  The idea is to allocate stack space to a thread incrementally, instead
of allocating a (large) worst-case block of memory. This is especially
important on 32 bit systems (where the virtual address space is
limited). On 64 bit systems, (as Joerg pointed out), it will be useful
in situations where the code uses a continuation passing style (like
Chicken Scheme) or in critical software like kernels where it may not
actually be possible to assign more than a certain amount of memory to
the stack. In a situation where stacks can smaller in size than a single
page, everyone wins.

  This summer, I plan to implement this for the x86 (including x86-64)
platform, since that is the platform I'm most familiar with.

* Overall Design
  The overall design of is described below. More concrete details have
been put under the 'Implementation' section.

** The idea
   The central idea is to represent the stack as a doubly-linked list of
memory blocks instead of a (large) contiguous chunk of memory. Each
stack-frame will be made to fit entirely in one block, but one block may
(and, in general, will) contain more than one stack frame.

*** Block lifetime
    Blocks once allocated are not de-allocated - since they are still a
part of the doubly-linked list (and hence, not /lost/), there will be no
memory leak (and the blocks will be re-used the next time more memory is
required).

** Issues with unwinding
   Dividing up the stack into a list of smaller blocks violates one
important constraint: stack addresses are no longer continuous between
function invocations. This may affect any runtime magic that involves
unwinding the stack, like exception handling and scanning the stack for
GC roots.
	However, assuming the unwinding routine uses the .eh_frame section,
unwinding should work fine; since we'll also emit a correct  .eh_frame.
Edge cases (if they arise) will have to be eliminated on a case by case
basis.

* Implementation
  This section describes the proposed implementation in more detail.

** The block
   The block is a chunk of memory, whose size is a power of two (in
bytes). It has four words reserved in the beginning of the block:

 1. previous - Pointing to the previous block.
 2. next - Pointing to the next block.
 3. saved_ip - The original return address back to the callee.
 4. saved_sp - The value of the SP when this block was created.

** Changes to the function prologue
   In its prologue, a function will check whether the current block has
enough stack space. This is easily done for function which use a fixed
amount of stack space. For the ones which don't, we can assume some
worst-case upper bound. Since the block size is a constant power of two,
this can be done using a bitwise and and an integer comparison. If there
isn't enough space in the current block, the control jmps (does not
call) to a naked function (called setup_new_block)

*** setup_new_block:
setup_new_block will be small naked function emitted into all programs
being compiled with segmented stacks. It does the following:

 1. Checks if the current block's next pointer is populated or not.
 2. In case it isn't (i.e. is NULL), it allocates a new block, sets the
next pointer to point to the new block. It then sets the 'previous'
pointer of the new block to point to the old block. Otherwise, it
directly goes to step 3.
 3. Everything addressed relative to the stack pointer (in the function
body) is copied onto the new block. This is so that the SP relative
loads and stores work correctly.
    For x86 systems, this includes the function arguments passed on
stack, a return address (not the original, mentioned below), the
previous stack pointer.
    I don't think we will need to copy the actual argument values in
case of a vararg function (we can't, in fact), since we'll be accessing
the arguments using the fields in the va_list structure (and they point
back to the previous block, where the arguments actually are).
    The return address placed on the new block is not the original
return address placed on the old stack by the callee, but is the address
of a another naked function (called cleanup_new_block) described below.
 4. The stack pointer is set to (the address of the new block + 3 * word
size).
 5. Control is returned to the prolgue, and execution continues
normally. We cannot use a usual ret here, so we'll have to use one of
the caller-saved registers to communicate the address of the next
instruction to setup_new_block.

**** Frame pointers
     In case the frame pointer is being maintained for the function, the
value at %rbp will also have to be copied to the new block, and %rbp
will have to be changed to point to this new location (so that the FP
relative indexing works). %rbp will then be set to its "correct" value
by the epilogue itself. We'll probably need a new routine
(set_new_block_fp) for this situation.

**** Stack space for setup_new_block
     setup_new_block and setup_new_block_fp will be written to use the
minimum amount of stack space possible, and that stack space will have
to be factored into the check in the function prologue (so every
function prologue checks for the space it needs _plus_ the space
required to run setup_new_block once, by a callee).

*** cleanup_new_block
    cleaup_new_block will be a simple operation, which restores the old
SP (saved_sp block header) and jumps to saved_ip (also from the block
header). Later, if we decide to free unused blocks then that cleanup can
be done here.

*** Eliminating allocations and checks
    It is evident that setup_new_block is too expensive to run in every
prologue. In fact, it is probably worth the extra effort to eliminate
the check in as many situations as possible. I was pointed to a paper
earlier - `Whole-Program Optimization for Time and Space Efficient
Threads', which talks about link-time determination of upper bounds on
stack space required by threads. It does this by creating a call-graph
and using the same techniques used for analyzing a program's control
flow graph (dominators, back edges, et. al.).
    So, if the function A is dominated by the function B, and A needs 32
bytes of stack space, and B needs 64, we can allocate 96 bytes in B's
prologue, and the check can be eliminated from A altogether. Even
otherwise, if leaf functions F, G and H are reachable only via I and J,
and F, G, H, I, J take f, g, h, i, j bytes of stack space respectively,
we should be able to check for (MAX (f, g, h) + i) bytes in I and (MAX
(f, g, h) + j) bytes in J; and eliminate the checks in F, G and H
altogether. In my opinion, this should lead to substantial savings,
especially keeping in mind the typical pattern of a module having a few
publicly visible functions which delegate the core functionality to a
battery of smaller functions with internal linkage and private visibility.

*** The red zone
    The red zone can be usable only if there actually _is_ 128 bytes of
stack space left in the current block. That will need to be factored
into the check (in X86FrameLowering::emitPrologue).

*** DWARF .eh_frame
    (This is largely unchanged from the previous version I sent on the
list. I've repeated it here for completeness).
    DWARF info can be generated as follows: since we know the offset of
base of the stack frame from the stack pointer (or we are maintaining a
frame pointer), we can always say whether the concerned frame is the
first frame in the block. If not, all the previous register values can
be computed as usual, and in the first case, we can add an extra
indirection, involving looking up the stack pointer saved in this
block's header.

*** Interoperability
    Clearly, we can't safely make calls from code using segmented-stacks
to code expecting a vanilla contiguous stack. This can be fixed by the
linker by padding calls into non-segmented-stack code from
segmented-stack code with allocation and de-allocation of a 64 k block.
For the linker to be able to do this, the compiler needs to emit a dummy
section into object files compiled with segmented stacks. The linker
will then use as a flag. This is exactly the approach mentioned in [1].
However, this approach is suboptimal, and to get full benefits of
segmented stacks, all code needs to be compiled with segmented stacks.

[1] http://gcc.gnu.org/wiki/SplitStacks

-- 
Sanjoy Das
http://playingwithpointers.com