[cfe-commits] Patch to do static analysis with range sets

Mon Feb 9 20:13:51 PST 2009

Hi Ben,

I have a fairly long response to this one.  I apologize in advance if  
it comes off as a ramble!

On Feb 8, 2009, at 6:57 AM, Ben Laurie wrote:

>> It also looks like you are modifying
>> the RangeSet object in place. In the static analyzer, we use  
>> functional
>> data structures that preserve old data so that each GRState object  
>> (which
>> represents the analysis state at a particular point within a path)  
>> never
>> have their values mutated after the GRState object is created.   
>> This allows
>> us to faithfully keep full path information.  This allows us to  
>> report full
>> path diagnostics when reporting bugs but also (from a correctness
>> standpoint) allows path caching/merging to work at all.
>
> Wow. Doesn't it turn out to be horribly expensive?

Potentially.  I'll elaborate on some key design points.

First off, the interface to the public methods of ConstraintManager  
are fairly straightforward.  All data coming in is immutable, and all  
data going out is immutable.  It essentially follows a functional  
programming design.

Second, per your comment on keeping all state values around being  
"horribly expensive", path-sensitive static analysis algorithms, where  
we reason precisely about individual paths, have worst case  
exponential time complexity.  This means that while constant factors  
remain important, real scalability comes with clever algorithms that  
reduce the analysis complexity for most practical inputs.  We employ a  
bunch of tricks to keep the time complexity low (and bounded) while  
also keeping the space complexity bounded as well.

While there are many ways to do path-sensitive analysis, the approach  
we are using is essentially symbolic execution.  Each path is  
"simulated" using abstract program values.  The collection of these  
values, which represents bindings between variables and abstract data,  
is called a "state" (represented in the analyzer by the object  
GRState).  As we simulate a path, we evaluate the effects of an  
expression and update the values of the state.

Because we are reasoning about symbolic values, at branches it is  
often the case that both branch conditions are feasible, with the path  
splitting into two paths at the branch.  If GRState is an imperative  
data structure, such as std::set or std::map, we essentially would  
need to make a copy of GRState for each branch and then continue  
simulating along either path.  We could do this simulation in a DFS  
fashion through the function's control-flow graph, gradually updating  
the state along a path until we hit (a) the end of the function or (b)  
a bug that for all intensive purposes ends simulation along that  
path.  Once we hit the end of the path, we could simulate an alternate  
path resulting from a split at a branch earlier in the simulation.

The algorithm I described has exponential time complexity, as the  
number of paths potentially doubles at each branch.  One natural  
optimization is to stop execution of a path when you hit the same  
"state" at the same location within a function.  This can be done  
because we are reasoning about symbolic values, and thus if we hit a  
location with the same state that we encountered along a different  
path we can just "cache out" because all further simulation would be  
the same as the other path.  While intuitively simple, this  
optimization, when done right, can transform the exponential search  
algorithm into something that is quite reasonable in practice.

One could employ this caching approach when using imperative state  
values, but it gets tricky.  It means that one needs efficient ways to  
compare a current state against the state of previously analyzed  
paths.  This either requires keeping full state information around to  
do this comparison precisely, or to do an approximate solution such as  
hashing state values and comparing states using their hashes.  In the  
analyzer, we keep full state information around, allowing us to  
accurately compare paths against previously explored paths.  This  
eliminates subtle bugs caused by the approximate comparison of states  
and also means we have a great deal of information around when decided  
when two paths are "the same."  (equality is only one way to compare  
states; one can also reason about how one state "subsumes" another).

Another reason why keeping full state values around is for error  
reporting.  Using an imperative approach (where we mutate a state in  
place) causes us to forget the state of the simulation along a path.   
If we do a DFS traversal of the state graph, we could potentially keep  
all the states along a particular path around (e.g., by making copies  
of the state at each subsequent expression) and if we encounter a bug  
we can present the full path information.  Making copies of states  
isn't very efficient, and if you only record the diffs you are  
essentially using a functional data structure (more on this later).   
If you want to find the shortest path to a bug (which can be very  
helpful when presenting a bug to a user), ultimately you need to  
reason about the data along multiple paths.  This can become expensive  
if one doesn't employ the right data structures that exploits the  
commonality between states.

The way Clang's static analysis engine does path exploration is by  
using functional data structures for its state.  Each GRState object,  
once created, is immutable.  When you do the processing to explore a  
new path, you will never trounce the data you have accumulated for  
paths you have already explored.  From a correctness perspective, this  
makes it much easier to write complex analyses.  You never are in risk  
of invalidating iterators and scribbling on state you meant to keep.   
Everything is reasoned in terms of input and output and rarely, if  
ever, side-effects.  Further, the functional data structures we employ  
share most data, so you are really only recording the diffs between  
states (which is very efficient).

Note that we only keep full path information on a per-function basis.   
When doing inter-procedural analysis, we will need to employ something  
like a summary-based approach (where we build up pre- and post-  
condition models for functions) that allow us to reason about inter- 
procedural paths.

To summarize, here are a few big points of why keeping the full  
simulation state (for a function) is useful:

(a) It is *extremely* useful for diagnosing bugs by being able to  
render full information concerning a path to a user.  With the full  
state graph, we can tell users all paths where the same bug (e.g., a  
leak) occurs, or use that information to determine how likely a  
candidate bug is real by examining the nature of the paths it occurs  
on.  More generally, information thrown away is very difficult to  
recover.

(b) Functional data structures use a lot of data sharing, so the  
incremental cost of saving this extra information is pretty minimal  
and very efficient (i.e., most state stays the same between evaluated  
expressions).

(c) Meta-level checkers can be written on top of the explored state  
space graph.

(d) We can do some very elaborate caching/pruning of paths by  
explicitly recording the explored state space graph.

(e) The invariants are simple: don't scribble on old data.  This makes  
reasoning about the correctness of the analyzer more "inductive" and  
compositional.  I cannot stress how important this is, as the nature  
of path-sensitive analysis quickly gets complicated enough that  
without such compositional reasoning there is little hope and getting  
things right.

Prior to working the static analysis component of Clang, I worked on a  
few different static analysis systems, many based on the approach of  
symbolic execution.  I generally found that implementing these tools  
in imperative languages like C often forced an implementation based on  
imperative thinking where a bunch of information was always thrown  
away because it was deemed "too hard" or "too expensive" to keep  
around.  Later, when I implemented some of the same concepts using  
functional languages (e.g., OCaml), I found that because most of the  
symbolic information between states is usually the same it was (a)  
cheap and easy to keep full state information around and (b) it made  
the implementation much simpler and more resilient because all data  
was immutable.  When I started work on the analyzer in Clang, I wrote  
the ImmutableMap and ImmutableSet data structures (which were inspired  
by the Map and Set modules in  OCaml) so that we could harness these  
concepts in an analyzer written in C++.

While I have not done extensive performance evaluation, so far the  
approach appears to work well, and the core of the analyzer was able  
to be brought up in a fairly short time.  The data structures  
ImmutableMap and ImmutableSet use a fast pool allocator to allocate  
all shared state (i.e., the maps and sets) created by a single  
"factory" object.  Once one is done analyzing a function, all of that  
memory is reclaimed immediately without running any destructors,  
calling 'delete' or 'free' on each object, etc.

The ImmutableSet and ImmutableMap classes also provide referential  
canonicalization.  For example, creating an ImmutableMap where I first  
insert "3" and then "4" leaves me with the same map (as in they can be  
compared using a single pointer comparison to see if they are the  
same) if I did an insert of "4" and "3" with any number of removals or  
insertions in between.

Finally, the use of functional data structures and functional  
reasoning in the analyzer also opens up some long term possibilities.   
The analyzer can be potentially be highly parallelized, with different  
paths within a function explored simultaneously on different threads.   
Because data is immutable, one can worry substantially less about one  
thread trouncing on the data and invariants of another thread.

I apologize if this was a bit of a ramble, and I would be happy to  
talk about more specific points.

Ted