[LLVMdev] Changing pointer representation?
Jules
jules at dsf.org.uk
Fri Dec 1 02:41:28 PST 2006
Having finally found some time to work on this project, I'm currently
looking at mechanisms of augmenting LLVM to catch out-of-bounds pointer
references.
For a variety of reasons, I don't think the approach taken by the
Safecode project is appropriate for mine -- particularly, I have no
requirement to interface to external code (all code in the system will
either be compiled using LLVM or written specifically to interface with
LLVM-compiled code), which invalidates a key assumption of that
project. Therefore, having looked at the available options, I've
decided a so-called "fat pointer" representation is ideal for my project.
I can see two possible approaches for this:
* Modify the LLVM machine-code backend to use a 64-bit pointer
representation (32-bit base address, which points to an object
descriptor, and a 32-bit offset from the base of the object for the data
item pointed to) on a 32-bit architecture (or 128 bits on a 64-bit
architecture), and then change the definition of the dereference
instruction to check the range with the descriptor, or
* Create an optimizer pass that performs a code translation, modifying
all places where pointers are stored to include base pointers and
offsets (i.e., replace 'zzz *' with '{{ int, [0 x zzz] }*, int}', and
all places pointers are referenced and dereferenced to track and check
the base and limits from the descriptors. It then becomes illegal to
performing indexing on a pointer that does not point to the base of an
object.
I'm currently leaning towards the latter, primarily because it seems
more general; in the end, I'm going to want at least x86 and x86-64
support, and the former approach will mean I'll need to do the work
twice for two different platforms.
I'm also trying to work out what to do to pointers to elements of
complex structures, and what kind of dereferencing is allowed on those.
My current feeling is:
* If an object has a descriptor associated, the lowest allowable offset
will be 4 (because offset 0 contains the length of the object). This
means I can reserve offset 0 as an indicator for 'this object doesn't
have a descriptor' and cause any dereferencing of the result of pointer
arithmetic to fail on objects with offset 0. I'd probably swap the
pointer for a special 'invalid pointer' value on detecting such arithmetic.
* All arrays should have a descriptor, wherever they're allocated, as
part of a complex type, directly on the stack or on the heap.
* This means I'll need to change the behaviour of:
* getelementptr, to set 'invalid pointer' values whenever an offset 0
pointer is used with a nonzero index, or if the result of a manipulation
would be to access offset 0 of a pointer that isn't at offset 0, and to
skip the descriptor on arrays embedded inside a complex type
* load and store instructions, to throw an exception on invalid
pointers and check bounds on pointers with descriptors, and to load and
store both base and offset whenever storing a pointer's data
* Any instruction that generates a pointer as its result, to produce
the base and offset rather than a simple pointer.
In most cases the offset will be zero. There's probably an
optimisation in this case that means the offset doesn't need to be
produced in many cases; perhaps by delaying its production until it is
stored in a pointer variable.
It occurs to me that some of the people here have surely worked on this
kind of thing before, and perhaps can relate some experiences of things
that have either worked or not worked. Am I doing anything stupid here?
Thanks!
Jules
More information about the llvm-dev
mailing list