[llvm-commits] Updated SAFECode Patch

Fri Jul 22 08:34:47 PDT 2011

On 7/21/11 2:38 AM, Chris Lattner wrote:
> On Jul 19, 2011, at 3:36 PM, John Criswell wrote:
>> Dear All,
>>
>> Attached is an updated version of the SAFECode patch.  This patch:
>>
>> 1) Is updated to the new LLVM type API; and
>> 2) Can just be applied using patch -p0<  scpatch instead of having to untar an archive of changes.
>> 3) Perhaps I forgot it in previous posts, but using patch -p0<  testpatch  in the test-suite directory will provide some test Makefiles which can be used by using "make TEST=sc report"
>> 4) It includes a RewriteOOB pass which implements a feature called Out of Bounds Pointer rewriting.  Basically, it allows SAFECode to permit out of bound pointers provided that such pointers are not dereferenced.  It's an optional feature.
> Hi John,
>
> Thanks for updating this and I apologize again 10x for the delay reviewing this.  This is a *lot* of code to wade through.  Overall, the code is well structured and well commented.

Thank you for reviewing our patch as well as the above compliment.
:)

> Here are a few micro-level comments:
>
> * Passes like InsertGEPChecks should be in a .cpp file that matches the pass name, and don't need the class to be exposed in a public header file (for the same reason the SCCP class (for example) is private to its own .cpp file).

I made my transform pass header files public so that any tool can create 
an instantiation of any of the SAFECode passes (to enable code reuse), 
and I advise graduate students at Illinois to do the same.  Is there a 
better way of allowing code to create an instantiation of a pass without 
#including a file with the pass's class definition?

> * There is some '#if 0' code that should be removed.

Some code is disabled because it doesn't work without other SAFECode 
optimizations that I did not include in the initial patch.  If I 
resubmit the patch (I'll discuss that below), do you still want such 
code removed, or would comments indicating why it's disabled suffice?

> * Did you guys actually write the asm blob in strncpy_asm?  Are you sure the code is not GPL?  It looks dead anyway, dead code should be zapped.

I believe we did, but I will double check.

> * Found a bug! :)
>
> +  Constant * GAVConst = M->getOrInsertFunction ("pchk_getActualValue",
> +                                                VoidPtrTy,
> +                                                VoidPtrTy,
> +                                                VoidPtrTy,
> +                                                NULL);
> +  Function * GetActualValue = dyn_cast<Function>(GAVConst);
> ...
> + use(GetActualValue)
>
> The dyn_cast should be a cast<>  or check for null.
>
> * The code isn't fully detangled from the other projects in the research group, there are still references to llva, declarations but no definition of things like pool_rindex, etc.

Ah, I see.  Some of those (like pool_rindex) are from safe versions of 
the C library functions that were just being added when I started 
creating the patch.  I can either include their implementation or remove 
them if I revise the patch.

Good catch on the llva_ references.  We'll get rid of those or refactor 
them into a different pass.

> * There are lots of minor coding style issues, such as the extra spaces in the code, some 80 col violations etc:

Could you clarify what you mean by extra spaces?  Do you mean blank 
lines, incorrect indentation, or something different?  I just want to 
make sure I understand what is different/wrong.

>
> +static inline Value *
> +peelCasts (Value * PointerOperand, std::set<Value *>  &  Chain) {
> +  Value * SourcePointer = PointerOperand;
>
> * Inefficient containers like std::set and std::map are used pervasively, as are verboten classes like std::ostream.

Is using the LLVM classes required or merely suggested?  I tend to code 
using the standard C++ containers because 1) I've run into weird 
behavior with the LLVM containers in the past; and 2) the C++ containers 
always have functioning iterators; and 3) they're fast enough for what 
these transforms are doing.

I agree on getting rid of std::ostream.

> * There are linux-system specific details in the code, I suspect it's not portable to other systems.  I also see various uses of unistd.h, errno.h etc.

There are some pieces of code that have to be customized to a particular 
OS.  For example, the memory remapping routines have to be specialized 
to each OS and have been adapted to Mac OS X and Linux.  The same is 
true for finding the program counter value when a segfault occurs.  
However, I believe all of this code is contained within the run-time; 
the transforms should be OS independent.

Are these small bits of OS dependency a problem for inclusion into LLVM, 
or were you thinking that I had used OS-specific features unnecessarily?

Regarding the header files, is it a problem to have a run-time that uses 
unistd.h, ernno.h, etc.?  I think these are standard POSIX header files.

> * Why do you have an implementation of a SplayTree?  Why is it better than a hash table or other data structure?

For the run-time checks, the question we are asking is whether a pointer 
falls within a valid object.  Therefore, what we need is a container 
that contains a set of ranges (lower and upper bounds on memory objects) 
and can be searched (i.e., given a value, does it fall within a range 
within the container).

Hash tables don't do this (as far as I know).  So, we are left with tree 
data structures.  The splay tree has a caching property which is useful 
for memory safety checks because there's usually a small working set of 
memory objects that are being checked.  The splay tree moves the ranges 
for these objects to the top of the tree, and so they're found more quickly.

In short, splay trees are the best lookup data structure for ranges of 
which we know.

> * The files have the old LLVM copyright header on them.

Thanks for catching that.  We'll fix it.

>
> * I see that the runtime library contains a bunch of pool allocation related stuff, is that used by the transformation patches or used by other parts of safecode that can't be included because they depend on DSA?

The pool allocation functions are needed for those that choose to use 
Automatic Pool Allocation (which will remain out-of-tree because it 
depends on DSA) to remove run-time checks and enforce sound points-to 
analysis and type-inference results.  There are also a few run-time 
functions used for features not included yet in the patch (e.g., the 
dangling pointer detection system which is enabled by a separate transform).

> * Where is the documentation on how to use this? :-)  What are the features and limitations of this work in the absence of DSA and the other pieces?  You've made claims that this is generally useful for partial applications and other code, is this really true how does it work?

Sorry, I overlooked documentation.  We have documentation on using 
SAFECode, but it will need to be updated since it describes how to use 
SAFECode using the sc tool, and one of the reasons to move SAFECode into 
LLVM is to avoid having to use the sc tool.
:)

As far as functionality, the short answer is that it is similar to 
valgrind but should be faster, be able to accurately check for array 
bounds violations (valgrind has to use heuristics for stack/global 
objects, and it is very slow when this option is enabled), and is much 
faster since it's a set of compiler transforms.

As for the long answer, the run-time checks included in the patch 
perform two operations.  First, they ensure that an LLVM load or store 
accesses a valid memory object.  Second, they ensure that a GEP doesn't 
move a pointer that is pointing within a memory object to some location 
outside of that same memory object.  The run-time checks will either 
halt the program (in production mode) or print debugging information 
about the memory safety error (if in debugging mode).

Since the instrumentation in the patch is designed to work on separation 
compilation units, the checks it inserts are "incomplete" checks.  This 
is done because, without whole program analysis, the compiler cannot 
tell if the pointer points to memory objects that are always created by 
the program or if it could point to memory objects allocated by external 
library code.  When a check is incomplete, the above checks are relaxed 
as follows:

a) Load/Store checks: In production mode, the check is completely 
ignored since failing to find a valid memory object doesn't imply that 
the pointer is invalid (e.g., the pointer could be pointing to a stack 
memory object allocated by external code).  In debugging mode, it prints 
a warning.

b) GEP checks: In both production mode and debugging mode, an incomplete 
GEP check will first see if the source pointer (the pointer operand to 
the GEP) is within a memory object.  If so, it makes sure the result of 
the GEP is within the same object.  If it can't find the memory object, 
it does nothing.

There is a separate SAFECode pass that uses DSA to change incomplete 
checks to complete checks, but it is not included in the patch.  I think 
it should be possible to build another incomplete-to-complete transform 
pass that does not use DSA, but I haven't worked on that yet, and I'm 
not sure if its' really needed as incomplete checks are still useful for 
debugging memory safety errors.  On a related note, some or all of the 
exploits that we detected and prevented in the Linux kernel were caught 
using incomplete checks.

>
>
> Anyway, those are some thoughts on the first pass through the code.  As it is, the code isn't ready to go in, and even if it were, it would be impossible to really review as a monolith.  Beyond that, I have my concerns about who is going to maintain and push forward this code going forward.  I'm not accusing you guys of it, but I have a general concerned about "abandonware" in the codebase, and do routinely purge apparently dead passes from the codebase.  Do you have particular clients that are actively using this and will defend against me not purging it in a future release?

Well, this is the catch-22 situation that I think we find ourselves in.  
We think that people are interested in memory safety tools like ours; a 
few people on llvmdev have expressed interest in memory safety tools, 
and many people use valgrind.  However, we believe that people don't try 
out or use tools like SAFECode in their current form because they're not 
convenient (i.e., not integrated into the front-end compiler with a 
simple command-line switch).  Likewise, it's difficult for us to get our 
code integrated into LLVM/Clang because no one uses it.
:)

That said, we at Illinois will be a strong client since we are having 
SAFECode evaluated over the next couple of years as part of a research 
program we're in.  Having better Clang/SAFECode integration in some form 
or another is essential to what we're doing.

> If you're driven to get this better integrated into LLVM and available to a wider community, there are two routes you can take:
>
> 1. The work can be decomposed into small and incremental pieces that each deliver specific value, and you can propose and integrate these one piece at a time.  For example, start with one of the passes and a minimal runtime library that will catch some subset of the bugs.  Once that is in, rinse and repeat for the next piece and the next piece etc.  This is a really long road for this much code, but particularly if it is delivering concrete value and there is a client that will be using it, it is the right way to get it into the project.

The problem is that nearly all of the transforms I submitted *are* 
needed to get anything useful.  There are two passes to insert run-time 
checks, three or four to instrument the code to register memory objects 
with the run-time, one to optimize trivial checks (which is essential to 
getting tolerable performance), and one utility pass that supports 
Out-of-Bounds pointer rewriting (this allows GEPs to go out-of-bounds as 
long as the resulting pointer isn't dereferenced).  A smaller patch 
would jettison one of the check instrumentation passes and the 
Out-of-Bounds transform, both of which are trivial passes.

The only real "bloat" in the patch is the run-time library.  Personally, 
I'm not convinced that it's the best use of time to rip parts of it out 
and then re-add those parts back in later.

> 2. You can rework the high level hooks of this to use the new PassManagerBuilder.h interface to "hook into" extension points in the standard pass manager, rework this code to be a clang plugin, and keep the code available in the safecode.llvm.org repo.  Doing this engineering work (to make it trivial to turn on with a stock clang) is really required for #1 anyway.  This will make it much easier for people to try safecode out, and we're pushing to split other parts of the codebase out into plugins as well (e.g. the clang static analyzer and the ARC migration tool), so hopefully that plugin infrastructure will continue to develop.

Is there documentation on how to do this?  How mature is this plugin 
infrastructure?  How many other tools (e.g., clang static analyzer) are 
using it now?

Provided that configuring Clang to pick up and use the plugin is simple, 
this may be an attractive option as it seems to combine the best of both 
worlds (easy-to-use SAFECode while keeping SAFECode development fast).

> I guess option #3 is to maintain the status quo and keep safecode in its own repo right now.  Depending on what your goals are, this is actually a very attractive option for the same reasons that LLDB being in its own repo is: you can choose when and how to synch up to llvm (lldb just uses a "current recommend revision # of llvm), being a subproject allows you to move more quickly and make changes when it makes sense to you.

I am (painfully?) aware of the pros and cons of keeping code out of 
tree.  I agree that keeping SAFECode in its own tree makes development 
much faster (which I think is valuable) at the cost of chasing LLVM's 
ever-changing API.  However, SAFECode differs from LLDB in that LLDB is 
its own stand-alone tool while SAFECode is essentially an extension of 
an existing compiler.  Since SAFECode has not been well-integrated into 
the compiler, it is too cumbersome to use.

>
> Anyway, I guess my message is that the route forward depends on what your goals are.  If your goal is to get it into mainline LLVM to get more users, then I think that things are backwards: we really want there to be end users driving things into the compiler, not the other way around.  If you're looking to reduce the cost of merging mainline API changes, there are *many* opportunities to adopt the C API in these passes, which would greatly reduce the amount of churn required to keep stuff working.  Even though there isn't C API for everything, there is for the most volatile stuff.

Making SAFECode easier for people to use is the primary goal.  A 
secondary goal is to have less code maintenance on our end.  Much of our 
maintenance is keeping up with LLVM API changes; if we don't work with 
mainline or the last release, we quickly fall into irrelevance.  A third 
goal is to provide a common infrastructure upon which memory safety type 
tools can be built.  It would be nice, for example, to be able to use 
the same optimizations (e.g., static array bounds checking) for 
different memory safety implementations.

> I hope this helps,

It's been very insightful.  Thanks again for the review.

I think we'll seriously explore the option of making SAFECode a Clang 
plugin.  If you could direct us to documentation on how to do that, 
that'd be great.

While my other questions on the patch may not be as important if we go 
the plugin route, I'd like to know simply for my own edification (in 
case I need to submit a patch on something else in the future).

Thanks again, Chris.

-- John T.

>
> -Chris
>
>
>
>
>