[LLVMdev] [PATCH] Symbol offsets

Tue May 27 18:55:50 PDT 2014

On Tue, May 27, 2014 at 05:48:32PM -0700, Reid Kleckner wrote:
> I'm a little concerned we got prefix data wrong.  We had the following
> motivating use cases:
> 
> 1. Function prologue sigils, where we emit a special nop slide, maybe with
> data in it.  Peter implemented a ubsan feature using this.
> 
> 2. Function hotpatching, where we emit some data before the function and a
> special nop before the function.  Typically the nop is 'mov edi, edi' on
> x86 Windows, preceded by five bytes of padding for a long jump.  Profilers
> can uses this to turn on and off instrumentation of a running binary.
> 
> 3. Tables-before-code, where data is completely prior to the code.  GHC
> needs this.
> 
> In all cases, any code inside the prologue had no meaning to LLVM.
>  Inlining a function with a funky prologue is completely valid.
> 
> I worry that symbol_offset combined with prefix are too low-level.  What if
> we split this up into something like prefix data "prologue" data?  Prefix
> data would be an arbitrary LLVM constant, and prologue data is a byte
> sequence of native executable code.  Something like:
> 
> define void @foo() prefix [i8* x 2] { i8* @a, i8* @b } prologue [i8 x 4]
> c"\xde\xad\xbe\xef" { ret void }
> 
> I think the two forms are fundamentally equivalent to optimizations like
> global constant propagation, but it'd be nice to have an intuitive
> representation.  One of the strengths of LLVM's IL is that it's
> comprehensible to mere mortal compiler engineers, and not just computer
> programs.

I like this proposal. Now that I've thought about it more, I think it might
not be too important for global variables and functions to share a similar
representation for offsets. One comment though.

Before, when I was thinking about calls to functions with prefix data in
cases where the function entry point appears after the data (i.e. GHC's use
case), I was imagining that we could have two new properties for functions:
the symbol offset and the entry point offset. UBSan etc would set both to
zero, while GHC would set the former to zero and the latter to the size of
the prefix.

Provided that we need to cater to platforms where the function's metadata
cannot appear before the function's symbol (which I believe to be the case
on at least Darwin) we need some way of representing the distance between
the symbol and the entry point in external function declarations. Under your
proposal, we could probably do that by having a way of representing the type
of the prefix separately from its "initializer".

Thanks,
-- 
Peter