[llvm-dev] Contributing a new sanitizer for pointer casts

Tue Apr 25 06:54:12 PDT 2017

Hi all,

Some of you might remember that at EuroLLVM last year in Barcelona, 
Chris Diamand and I gave a talk about Clang/libcrunch, a run-time 
checking system which can be thought of as another flavour of sanitizer. 
It checks pointer casts, using run-time type information. Roughly the 
check is that the pointer really points to an instance of the target 
type, though there are refinements to deal with various idioms violating 
that. <http://www.llvm.org/devmtg/2016-03/#presentation9>

(I dropped a mention of this in the recent TBAA sanitizer thread, but
consensus was that on balance it's a different enough tool to want
both.)

My current research funding has some room for tech transfer activity, so
I've been spending some time on improving the code, with a hope of
eventually contributing it to LLVM.

This mail is just to get a handle on two questions: how much interest is 
there in this, and what changes are most important in order to get 
something contributable?

The system is a bit complex, so let me give you an overview of how it
currently works. (If you want full technical details, there are a couple
of research papers you could read -- see the bottom.)

- Instrumentation: this adds checks on (most) pointer casts, and in a
few other places. It also does a little source-level analysis to dump
information about allocation sites. We have both (my original) CIL and
(Chris's) Clang/LLVM implementations of this. The Clang version is not
too pretty at present: it uses -include'd inline helper functions
written in C and shared with the CIL implementation. It also requires a
bit of a hack to propagate certain type info (in uses of "sizeof")
onwards to LLVM so it can be used in a data-flow analysis.

- Hints from the programmer: these are necessary to declare allocation
functions, besides standard ones (malloc etc.). This is currently done
with an environment variable (LIBALLOCS_ALLOC_FNS) though I've thought
of adding a command-line option too. These declarations have effect at
both compile and link time.

- Compiler wrapper and helper tools: currently a mixture of shell,
Python and C++ helpers building on a pile of my own libraries
(libdwarfpp, dwarfidl, liballocstool), for DWARF analysis and
postprocessing. Roughly, these are responsible for generating and
linking the run-time type information itself.

- Type information. This is autogenerated uniqued / COMDAT'd instances 
of a moderately complex (but compact) C struct for each distinct data 
type. The model of type info (but not the representation) is somewhat 
DWARF-inspired.

- Runtime. This is a preloadable shared library which does the 
dispatching of the checks. It also gets its hooks into various places to 
load type info as necessary, and to observe various kinds of allocation 
happening within the process. Again it builds on a pile of my other 
stuff (liballocs, which builds on trap-syscalls, mallochooks, 
libdlbind).

Currently, my plan in a nutshell is to eliminate the C inline helpers in
favour of fully IR-level instrumentation, and also eliminate the
compiler wrapper in favour of a gold plugin (and maybe a bit of help in
the clang driver). This should result in a contributable diff that adds
a new sanitizer option (currently "-fsanitize=crunch", but name
negotiable :-). Binaries built this way will also require the gold
plugin and runtime (both out-of-tree) to do useful checking.

I don't intend to port the runtime. Although in principle this could
share code with the sanitizer runtimes, that's a lot of work and I don't
have the resource to visit this right now... barring major rewrites, the
runtime pretty much has to be GPL-licensed anyway, since it borrows code
from glibc and Xen (for purposes I'm pretty sure are not covered by the
sanitizer runtimes).

So my questions for you are whether this contribution would be welcome, 
and in particular any red lines about how to do instrumentation, how to 
factor everything, and how to deal with the external dependencies. As I 
currently envisage things, the gold plugin must live out-of-tree since 
it will require my libraries to build; I don't believe equivalent 
library support exists within LLVM. This being out-of-tree seems not a 
huge loss given that the runtime also will be.

Oh, and runtime support exists for x86-64/Linux only at the moment, 
though there is a bit of code for FreeBSD.

For the interested, here are the research papers I mentioned.

"Dynamically diagnosing run-time type errors in unsafe code" (OOPSLA '16)
http://www.cl.cam.ac.uk/~srk31/#oopsla16a

"Towards a dynamic object model within Unix processes" (Onward! '15)
http://www.cl.cam.ac.uk/~srk31/#onward15

Code: <https://github.com/stephenrkell/liballocs>
<https://github.com/stephenrkell/libcrunch>
<https://github.com/stephenrkell/clangcrunch>.

All thoughts appreciated... let me know if you see any obstacles to
contribution, or if you're able to help, or just if you have questions.
Much obliged,

Stephen.