[llvm-commits] Updated SAFECode Patch
John Criswell
criswell at illinois.edu
Fri Jul 22 08:34:47 PDT 2011
On 7/21/11 2:38 AM, Chris Lattner wrote:
> On Jul 19, 2011, at 3:36 PM, John Criswell wrote:
>> Dear All,
>>
>> Attached is an updated version of the SAFECode patch. This patch:
>>
>> 1) Is updated to the new LLVM type API; and
>> 2) Can just be applied using patch -p0< scpatch instead of having to untar an archive of changes.
>> 3) Perhaps I forgot it in previous posts, but using patch -p0< testpatch in the test-suite directory will provide some test Makefiles which can be used by using "make TEST=sc report"
>> 4) It includes a RewriteOOB pass which implements a feature called Out of Bounds Pointer rewriting. Basically, it allows SAFECode to permit out of bound pointers provided that such pointers are not dereferenced. It's an optional feature.
> Hi John,
>
> Thanks for updating this and I apologize again 10x for the delay reviewing this. This is a *lot* of code to wade through. Overall, the code is well structured and well commented.
Thank you for reviewing our patch as well as the above compliment.
:)
> Here are a few micro-level comments:
>
> * Passes like InsertGEPChecks should be in a .cpp file that matches the pass name, and don't need the class to be exposed in a public header file (for the same reason the SCCP class (for example) is private to its own .cpp file).
I made my transform pass header files public so that any tool can create
an instantiation of any of the SAFECode passes (to enable code reuse),
and I advise graduate students at Illinois to do the same. Is there a
better way of allowing code to create an instantiation of a pass without
#including a file with the pass's class definition?
> * There is some '#if 0' code that should be removed.
Some code is disabled because it doesn't work without other SAFECode
optimizations that I did not include in the initial patch. If I
resubmit the patch (I'll discuss that below), do you still want such
code removed, or would comments indicating why it's disabled suffice?
> * Did you guys actually write the asm blob in strncpy_asm? Are you sure the code is not GPL? It looks dead anyway, dead code should be zapped.
I believe we did, but I will double check.
> * Found a bug! :)
>
> + Constant * GAVConst = M->getOrInsertFunction ("pchk_getActualValue",
> + VoidPtrTy,
> + VoidPtrTy,
> + VoidPtrTy,
> + NULL);
> + Function * GetActualValue = dyn_cast<Function>(GAVConst);
> ...
> + use(GetActualValue)
>
> The dyn_cast should be a cast<> or check for null.
>
> * The code isn't fully detangled from the other projects in the research group, there are still references to llva, declarations but no definition of things like pool_rindex, etc.
Ah, I see. Some of those (like pool_rindex) are from safe versions of
the C library functions that were just being added when I started
creating the patch. I can either include their implementation or remove
them if I revise the patch.
Good catch on the llva_ references. We'll get rid of those or refactor
them into a different pass.
> * There are lots of minor coding style issues, such as the extra spaces in the code, some 80 col violations etc:
Could you clarify what you mean by extra spaces? Do you mean blank
lines, incorrect indentation, or something different? I just want to
make sure I understand what is different/wrong.
>
> +static inline Value *
> +peelCasts (Value * PointerOperand, std::set<Value *> & Chain) {
> + Value * SourcePointer = PointerOperand;
>
> * Inefficient containers like std::set and std::map are used pervasively, as are verboten classes like std::ostream.
Is using the LLVM classes required or merely suggested? I tend to code
using the standard C++ containers because 1) I've run into weird
behavior with the LLVM containers in the past; and 2) the C++ containers
always have functioning iterators; and 3) they're fast enough for what
these transforms are doing.
I agree on getting rid of std::ostream.
> * There are linux-system specific details in the code, I suspect it's not portable to other systems. I also see various uses of unistd.h, errno.h etc.
There are some pieces of code that have to be customized to a particular
OS. For example, the memory remapping routines have to be specialized
to each OS and have been adapted to Mac OS X and Linux. The same is
true for finding the program counter value when a segfault occurs.
However, I believe all of this code is contained within the run-time;
the transforms should be OS independent.
Are these small bits of OS dependency a problem for inclusion into LLVM,
or were you thinking that I had used OS-specific features unnecessarily?
Regarding the header files, is it a problem to have a run-time that uses
unistd.h, ernno.h, etc.? I think these are standard POSIX header files.
> * Why do you have an implementation of a SplayTree? Why is it better than a hash table or other data structure?
For the run-time checks, the question we are asking is whether a pointer
falls within a valid object. Therefore, what we need is a container
that contains a set of ranges (lower and upper bounds on memory objects)
and can be searched (i.e., given a value, does it fall within a range
within the container).
Hash tables don't do this (as far as I know). So, we are left with tree
data structures. The splay tree has a caching property which is useful
for memory safety checks because there's usually a small working set of
memory objects that are being checked. The splay tree moves the ranges
for these objects to the top of the tree, and so they're found more quickly.
In short, splay trees are the best lookup data structure for ranges of
which we know.
> * The files have the old LLVM copyright header on them.
Thanks for catching that. We'll fix it.
>
> * I see that the runtime library contains a bunch of pool allocation related stuff, is that used by the transformation patches or used by other parts of safecode that can't be included because they depend on DSA?
The pool allocation functions are needed for those that choose to use
Automatic Pool Allocation (which will remain out-of-tree because it
depends on DSA) to remove run-time checks and enforce sound points-to
analysis and type-inference results. There are also a few run-time
functions used for features not included yet in the patch (e.g., the
dangling pointer detection system which is enabled by a separate transform).
> * Where is the documentation on how to use this? :-) What are the features and limitations of this work in the absence of DSA and the other pieces? You've made claims that this is generally useful for partial applications and other code, is this really true how does it work?
Sorry, I overlooked documentation. We have documentation on using
SAFECode, but it will need to be updated since it describes how to use
SAFECode using the sc tool, and one of the reasons to move SAFECode into
LLVM is to avoid having to use the sc tool.
:)
As far as functionality, the short answer is that it is similar to
valgrind but should be faster, be able to accurately check for array
bounds violations (valgrind has to use heuristics for stack/global
objects, and it is very slow when this option is enabled), and is much
faster since it's a set of compiler transforms.
As for the long answer, the run-time checks included in the patch
perform two operations. First, they ensure that an LLVM load or store
accesses a valid memory object. Second, they ensure that a GEP doesn't
move a pointer that is pointing within a memory object to some location
outside of that same memory object. The run-time checks will either
halt the program (in production mode) or print debugging information
about the memory safety error (if in debugging mode).
Since the instrumentation in the patch is designed to work on separation
compilation units, the checks it inserts are "incomplete" checks. This
is done because, without whole program analysis, the compiler cannot
tell if the pointer points to memory objects that are always created by
the program or if it could point to memory objects allocated by external
library code. When a check is incomplete, the above checks are relaxed
as follows:
a) Load/Store checks: In production mode, the check is completely
ignored since failing to find a valid memory object doesn't imply that
the pointer is invalid (e.g., the pointer could be pointing to a stack
memory object allocated by external code). In debugging mode, it prints
a warning.
b) GEP checks: In both production mode and debugging mode, an incomplete
GEP check will first see if the source pointer (the pointer operand to
the GEP) is within a memory object. If so, it makes sure the result of
the GEP is within the same object. If it can't find the memory object,
it does nothing.
There is a separate SAFECode pass that uses DSA to change incomplete
checks to complete checks, but it is not included in the patch. I think
it should be possible to build another incomplete-to-complete transform
pass that does not use DSA, but I haven't worked on that yet, and I'm
not sure if its' really needed as incomplete checks are still useful for
debugging memory safety errors. On a related note, some or all of the
exploits that we detected and prevented in the Linux kernel were caught
using incomplete checks.
>
>
> Anyway, those are some thoughts on the first pass through the code. As it is, the code isn't ready to go in, and even if it were, it would be impossible to really review as a monolith. Beyond that, I have my concerns about who is going to maintain and push forward this code going forward. I'm not accusing you guys of it, but I have a general concerned about "abandonware" in the codebase, and do routinely purge apparently dead passes from the codebase. Do you have particular clients that are actively using this and will defend against me not purging it in a future release?
Well, this is the catch-22 situation that I think we find ourselves in.
We think that people are interested in memory safety tools like ours; a
few people on llvmdev have expressed interest in memory safety tools,
and many people use valgrind. However, we believe that people don't try
out or use tools like SAFECode in their current form because they're not
convenient (i.e., not integrated into the front-end compiler with a
simple command-line switch). Likewise, it's difficult for us to get our
code integrated into LLVM/Clang because no one uses it.
:)
That said, we at Illinois will be a strong client since we are having
SAFECode evaluated over the next couple of years as part of a research
program we're in. Having better Clang/SAFECode integration in some form
or another is essential to what we're doing.
> If you're driven to get this better integrated into LLVM and available to a wider community, there are two routes you can take:
>
> 1. The work can be decomposed into small and incremental pieces that each deliver specific value, and you can propose and integrate these one piece at a time. For example, start with one of the passes and a minimal runtime library that will catch some subset of the bugs. Once that is in, rinse and repeat for the next piece and the next piece etc. This is a really long road for this much code, but particularly if it is delivering concrete value and there is a client that will be using it, it is the right way to get it into the project.
The problem is that nearly all of the transforms I submitted *are*
needed to get anything useful. There are two passes to insert run-time
checks, three or four to instrument the code to register memory objects
with the run-time, one to optimize trivial checks (which is essential to
getting tolerable performance), and one utility pass that supports
Out-of-Bounds pointer rewriting (this allows GEPs to go out-of-bounds as
long as the resulting pointer isn't dereferenced). A smaller patch
would jettison one of the check instrumentation passes and the
Out-of-Bounds transform, both of which are trivial passes.
The only real "bloat" in the patch is the run-time library. Personally,
I'm not convinced that it's the best use of time to rip parts of it out
and then re-add those parts back in later.
> 2. You can rework the high level hooks of this to use the new PassManagerBuilder.h interface to "hook into" extension points in the standard pass manager, rework this code to be a clang plugin, and keep the code available in the safecode.llvm.org repo. Doing this engineering work (to make it trivial to turn on with a stock clang) is really required for #1 anyway. This will make it much easier for people to try safecode out, and we're pushing to split other parts of the codebase out into plugins as well (e.g. the clang static analyzer and the ARC migration tool), so hopefully that plugin infrastructure will continue to develop.
Is there documentation on how to do this? How mature is this plugin
infrastructure? How many other tools (e.g., clang static analyzer) are
using it now?
Provided that configuring Clang to pick up and use the plugin is simple,
this may be an attractive option as it seems to combine the best of both
worlds (easy-to-use SAFECode while keeping SAFECode development fast).
> I guess option #3 is to maintain the status quo and keep safecode in its own repo right now. Depending on what your goals are, this is actually a very attractive option for the same reasons that LLDB being in its own repo is: you can choose when and how to synch up to llvm (lldb just uses a "current recommend revision # of llvm), being a subproject allows you to move more quickly and make changes when it makes sense to you.
I am (painfully?) aware of the pros and cons of keeping code out of
tree. I agree that keeping SAFECode in its own tree makes development
much faster (which I think is valuable) at the cost of chasing LLVM's
ever-changing API. However, SAFECode differs from LLDB in that LLDB is
its own stand-alone tool while SAFECode is essentially an extension of
an existing compiler. Since SAFECode has not been well-integrated into
the compiler, it is too cumbersome to use.
>
> Anyway, I guess my message is that the route forward depends on what your goals are. If your goal is to get it into mainline LLVM to get more users, then I think that things are backwards: we really want there to be end users driving things into the compiler, not the other way around. If you're looking to reduce the cost of merging mainline API changes, there are *many* opportunities to adopt the C API in these passes, which would greatly reduce the amount of churn required to keep stuff working. Even though there isn't C API for everything, there is for the most volatile stuff.
Making SAFECode easier for people to use is the primary goal. A
secondary goal is to have less code maintenance on our end. Much of our
maintenance is keeping up with LLVM API changes; if we don't work with
mainline or the last release, we quickly fall into irrelevance. A third
goal is to provide a common infrastructure upon which memory safety type
tools can be built. It would be nice, for example, to be able to use
the same optimizations (e.g., static array bounds checking) for
different memory safety implementations.
> I hope this helps,
It's been very insightful. Thanks again for the review.
I think we'll seriously explore the option of making SAFECode a Clang
plugin. If you could direct us to documentation on how to do that,
that'd be great.
While my other questions on the patch may not be as important if we go
the plugin route, I'd like to know simply for my own edification (in
case I need to submit a patch on something else in the future).
Thanks again, Chris.
-- John T.
>
> -Chris
>
>
>
>
>
More information about the llvm-commits
mailing list