[llvm-commits] Updated SAFECode Patch

Fri Jul 22 10:51:38 PDT 2011

On Jul 22, 2011, at 8:34 AM, John Criswell wrote:
>> Here are a few micro-level comments:
>> 
>> * Passes like InsertGEPChecks should be in a .cpp file that matches the pass name, and don't need the class to be exposed in a public header file (for the same reason the SCCP class (for example) is private to its own .cpp file).
> 
> I made my transform pass header files public so that any tool can create an instantiation of any of the SAFECode passes (to enable code reuse), and I advise graduate students at Illinois to do the same.  Is there a better way of allowing code to create an instantiation of a pass without #including a file with the pass's class definition?

Just follow the standard pattern used by the other passes, you only need to expose an "initialize" function and a "create" function (which returns an instance of your pass).  SCCP and InstCombine and many other passes follow this pattern.

>> * There is some '#if 0' code that should be removed.
> 
> Some code is disabled because it doesn't work without other SAFECode optimizations that I did not include in the initial patch.  If I resubmit the patch (I'll discuss that below), do you still want such code removed, or would comments indicating why it's disabled suffice?

If the patch is a proposal for inclusion in mainline, it should not have #if 0 code.

>> * There are lots of minor coding style issues, such as the extra spaces in the code, some 80 col violations etc:
> 
> Could you clarify what you mean by extra spaces?  Do you mean blank lines, incorrect indentation, or something different?  I just want to make sure I understand what is different/wrong.

As one simple example, this:

+static inline Value *
+peelCasts (Value * PointerOperand, std::set<Value *>  &  Chain) {
+  Value * SourcePointer = PointerOperand;

Should be written as:

+static Value *peelCasts(Value *PointerOperand, std::set<Value*> &Chain) {
+  Value *SourcePointer = PointerOperand;

(but without the std::set :)

>> +static inline Value *
>> +peelCasts (Value * PointerOperand, std::set<Value *>  &  Chain) {
>> +  Value * SourcePointer = PointerOperand;
>> 
>> * Inefficient containers like std::set and std::map are used pervasively, as are verboten classes like std::ostream.
> 
> Is using the LLVM classes required or merely suggested?  I tend to code using the standard C++ containers because 1) I've run into weird behavior with the LLVM containers in the past; and 2) the C++ containers always have functioning iterators; and 3) they're fast enough for what these transforms are doing.

We expect all code in llvm to follow best practices.  This is an evolving thing, but we really want someone to be able to read a random part of the code base and have it be in a consistent style.

> Are these small bits of OS dependency a problem for inclusion into LLVM, or were you thinking that I had used OS-specific features unnecessarily?

OS dependency isn't a problem, but it should be factored into the lib/Support library, and generalized to support the other targets that LLVM supports where appropriate.

> Regarding the header files, is it a problem to have a run-time that uses unistd.h, ernno.h, etc.?  I think these are standard POSIX header files.

unistd.h is not posix, and is not available on window for example.

>> * Why do you have an implementation of a SplayTree?  Why is it better than a hash table or other data structure?
> 
> For the run-time checks, the question we are asking is whether a pointer falls within a valid object.  Therefore, what we need is a container that contains a set of ranges (lower and upper bounds on memory objects) and can be searched (i.e., given a value, does it fall within a range within the container).

Ok, the implementation should move to llvm/ADT (if it wasn't already), with patches to llvm/docs/ProgrammersManual.html then.  Did you guys originate the code?  It looks like a very non-standard style, I'm concerned about the origin of the code.

>> * I see that the runtime library contains a bunch of pool allocation related stuff, is that used by the transformation patches or used by other parts of safecode that can't be included because they depend on DSA?
> 
> The pool allocation functions are needed for those that choose to use Automatic Pool Allocation (which will remain out-of-tree because it depends on DSA) to remove run-time checks and enforce sound points-to analysis and type-inference results.  There are also a few run-time functions used for features not included yet in the patch (e.g., the dangling pointer detection system which is enabled by a separate transform).

Ok, they should be removed if this goes into mainline.  Beyond patent concerns, APA is not production quality, and I don't think it will be without a rewrite.

>> * Where is the documentation on how to use this? :-)  What are the features and limitations of this work in the absence of DSA and the other pieces?  You've made claims that this is generally useful for partial applications and other code, is this really true how does it work?
> 
> Sorry, I overlooked documentation.  We have documentation on using SAFECode, but it will need to be updated since it describes how to use SAFECode using the sc tool, and one of the reasons to move SAFECode into LLVM is to avoid having to use the sc tool.
> :)

Right, it would be really nice for it to be transparently integrated into clang (for example) with a flag or plugin.

> Since the instrumentation in the patch is designed to work on separation compilation units, the checks it inserts are "incomplete" checks.  This is done because, without whole program analysis, the compiler cannot tell if the pointer points to memory objects that are always created by the program or if it could point to memory objects allocated by external library code.

>From a high level, how does the inserted code distinguish between an invalid pointer and a pointer to memory allocated from some code not visible to the transformation pass?  Does this handle inttoptr and ptrtoint?

>  When a check is incomplete, the above checks are relaxed as follows:
> 
> a) Load/Store checks: In production mode, the check is completely ignored since failing to find a valid memory object doesn't imply that the pointer is invalid (e.g., the pointer could be pointing to a stack memory object allocated by external code).  In debugging mode, it prints a warning.

I wasn't aware that there are two modes.  Printing a warning seems not-very useful if you're integrating with a large scale codebase that you're not able to modify.  For example, imagine building clang itself with safecode but not llvm: you'd get a ton of useless warnings every time LLVM IR objects were touched by clang IRGen.

> b) GEP checks: In both production mode and debugging mode, an incomplete GEP check will first see if the source pointer (the pointer operand to the GEP) is within a memory object.  If so, it makes sure the result of the GEP is within the same object.  If it can't find the memory object, it does nothing.

This approach seems highly limited when you don't have (nearly) the whole program.  Is this approach really useful without LTO that provides almost the entire program?

>> Anyway, those are some thoughts on the first pass through the code.  As it is, the code isn't ready to go in, and even if it were, it would be impossible to really review as a monolith.  Beyond that, I have my concerns about who is going to maintain and push forward this code going forward.  I'm not accusing you guys of it, but I have a general concerned about "abandonware" in the codebase, and do routinely purge apparently dead passes from the codebase.  Do you have particular clients that are actively using this and will defend against me not purging it in a future release?
> 
> Well, this is the catch-22 situation that I think we find ourselves in.  We think that people are interested in memory safety tools like ours; a few people on llvmdev have expressed interest in memory safety tools, and many people use valgrind.  However, we believe that people don't try out or use tools like SAFECode in their current form because they're not convenient (i.e., not integrated into the front-end compiler with a simple command-line switch).  Likewise, it's difficult for us to get our code integrated into LLVM/Clang because no one uses it.
> :)
> 
> That said, we at Illinois will be a strong client since we are having SAFECode evaluated over the next couple of years as part of a research program we're in.  Having better Clang/SAFECode integration in some form or another is essential to what we're doing.

Getting the code used is somewhat orthogonal from where it lives.  For example, you guys could produce "official" builds that align with llvm.org releases (e.g. safecode 2.9, safecode 3.0 etc).  This would be even better if it were refactored as a clang plugin, so your message could be "drop in this plugin and pass this extra flag to the compiler".

> The only real "bloat" in the patch is the run-time library.  Personally, I'm not convinced that it's the best use of time to rip parts of it out and then re-add those parts back in later.

As far as getting things integrated into mainline, the best thing you can do is to cut things down into a series of obvious patches.  However, given the description of what you're trying to achieve, it really sounds like the best approach is to keep safecode as a separate SVN module and work on making it integrate with the toolchain more directly.  Telling someone to add a "-load-plugin=safecode.so" flag to their CFLAGS or something would make it much easier to try out.

>> 2. You can rework the high level hooks of this to use the new PassManagerBuilder.h interface to "hook into" extension points in the standard pass manager, rework this code to be a clang plugin, and keep the code available in the safecode.llvm.org repo.  Doing this engineering work (to make it trivial to turn on with a stock clang) is really required for #1 anyway.  This will make it much easier for people to try safecode out, and we're pushing to split other parts of the codebase out into plugins as well (e.g. the clang static analyzer and the ARC migration tool), so hopefully that plugin infrastructure will continue to develop.
> 
> Is there documentation on how to do this?  How mature is this plugin infrastructure?  How many other tools (e.g., clang static analyzer) are using it now?

Doug gave a talk at boostcon about doing ast transformations with it, but no it is not very mature at all.  The static analyzer is moving to it, but there are no widely used plugins yet.

>> I guess option #3 is to maintain the status quo and keep safecode in its own repo right now.  Depending on what your goals are, this is actually a very attractive option for the same reasons that LLDB being in its own repo is: you can choose when and how to synch up to llvm (lldb just uses a "current recommend revision # of llvm), being a subproject allows you to move more quickly and make changes when it makes sense to you.
> 
> I am (painfully?) aware of the pros and cons of keeping code out of tree.  I agree that keeping SAFECode in its own tree makes development much faster (which I think is valuable) at the cost of chasing LLVM's ever-changing API.  However, SAFECode differs from LLDB in that LLDB is its own stand-alone tool while SAFECode is essentially an extension of an existing compiler.  Since SAFECode has not been well-integrated into the compiler, it is too cumbersome to use.

I see where you're coming from, but there are many similar useful extensions to the compiler.  I really think that developing the plugin API to do what safecode and others need is the right approach.

>> I hope this helps,
> 
> It's been very insightful.  Thanks again for the review.
> 
> I think we'll seriously explore the option of making SAFECode a Clang plugin.  If you could direct us to documentation on how to do that, that'd be great.

I'm not an expert on this, please ask on cfe-dev.  I honestly don't know how far along it is, just that we *have* to get there one way or the other.

-Chris