[cfe-commits] Patch to do static analysis with range sets
Ted Kremenek
kremenek at apple.com
Mon Feb 9 20:13:51 PST 2009
Hi Ben,
I have a fairly long response to this one. I apologize in advance if
it comes off as a ramble!
On Feb 8, 2009, at 6:57 AM, Ben Laurie wrote:
>> It also looks like you are modifying
>> the RangeSet object in place. In the static analyzer, we use
>> functional
>> data structures that preserve old data so that each GRState object
>> (which
>> represents the analysis state at a particular point within a path)
>> never
>> have their values mutated after the GRState object is created.
>> This allows
>> us to faithfully keep full path information. This allows us to
>> report full
>> path diagnostics when reporting bugs but also (from a correctness
>> standpoint) allows path caching/merging to work at all.
>
> Wow. Doesn't it turn out to be horribly expensive?
Potentially. I'll elaborate on some key design points.
First off, the interface to the public methods of ConstraintManager
are fairly straightforward. All data coming in is immutable, and all
data going out is immutable. It essentially follows a functional
programming design.
Second, per your comment on keeping all state values around being
"horribly expensive", path-sensitive static analysis algorithms, where
we reason precisely about individual paths, have worst case
exponential time complexity. This means that while constant factors
remain important, real scalability comes with clever algorithms that
reduce the analysis complexity for most practical inputs. We employ a
bunch of tricks to keep the time complexity low (and bounded) while
also keeping the space complexity bounded as well.
While there are many ways to do path-sensitive analysis, the approach
we are using is essentially symbolic execution. Each path is
"simulated" using abstract program values. The collection of these
values, which represents bindings between variables and abstract data,
is called a "state" (represented in the analyzer by the object
GRState). As we simulate a path, we evaluate the effects of an
expression and update the values of the state.
Because we are reasoning about symbolic values, at branches it is
often the case that both branch conditions are feasible, with the path
splitting into two paths at the branch. If GRState is an imperative
data structure, such as std::set or std::map, we essentially would
need to make a copy of GRState for each branch and then continue
simulating along either path. We could do this simulation in a DFS
fashion through the function's control-flow graph, gradually updating
the state along a path until we hit (a) the end of the function or (b)
a bug that for all intensive purposes ends simulation along that
path. Once we hit the end of the path, we could simulate an alternate
path resulting from a split at a branch earlier in the simulation.
The algorithm I described has exponential time complexity, as the
number of paths potentially doubles at each branch. One natural
optimization is to stop execution of a path when you hit the same
"state" at the same location within a function. This can be done
because we are reasoning about symbolic values, and thus if we hit a
location with the same state that we encountered along a different
path we can just "cache out" because all further simulation would be
the same as the other path. While intuitively simple, this
optimization, when done right, can transform the exponential search
algorithm into something that is quite reasonable in practice.
One could employ this caching approach when using imperative state
values, but it gets tricky. It means that one needs efficient ways to
compare a current state against the state of previously analyzed
paths. This either requires keeping full state information around to
do this comparison precisely, or to do an approximate solution such as
hashing state values and comparing states using their hashes. In the
analyzer, we keep full state information around, allowing us to
accurately compare paths against previously explored paths. This
eliminates subtle bugs caused by the approximate comparison of states
and also means we have a great deal of information around when decided
when two paths are "the same." (equality is only one way to compare
states; one can also reason about how one state "subsumes" another).
Another reason why keeping full state values around is for error
reporting. Using an imperative approach (where we mutate a state in
place) causes us to forget the state of the simulation along a path.
If we do a DFS traversal of the state graph, we could potentially keep
all the states along a particular path around (e.g., by making copies
of the state at each subsequent expression) and if we encounter a bug
we can present the full path information. Making copies of states
isn't very efficient, and if you only record the diffs you are
essentially using a functional data structure (more on this later).
If you want to find the shortest path to a bug (which can be very
helpful when presenting a bug to a user), ultimately you need to
reason about the data along multiple paths. This can become expensive
if one doesn't employ the right data structures that exploits the
commonality between states.
The way Clang's static analysis engine does path exploration is by
using functional data structures for its state. Each GRState object,
once created, is immutable. When you do the processing to explore a
new path, you will never trounce the data you have accumulated for
paths you have already explored. From a correctness perspective, this
makes it much easier to write complex analyses. You never are in risk
of invalidating iterators and scribbling on state you meant to keep.
Everything is reasoned in terms of input and output and rarely, if
ever, side-effects. Further, the functional data structures we employ
share most data, so you are really only recording the diffs between
states (which is very efficient).
Note that we only keep full path information on a per-function basis.
When doing inter-procedural analysis, we will need to employ something
like a summary-based approach (where we build up pre- and post-
condition models for functions) that allow us to reason about inter-
procedural paths.
To summarize, here are a few big points of why keeping the full
simulation state (for a function) is useful:
(a) It is *extremely* useful for diagnosing bugs by being able to
render full information concerning a path to a user. With the full
state graph, we can tell users all paths where the same bug (e.g., a
leak) occurs, or use that information to determine how likely a
candidate bug is real by examining the nature of the paths it occurs
on. More generally, information thrown away is very difficult to
recover.
(b) Functional data structures use a lot of data sharing, so the
incremental cost of saving this extra information is pretty minimal
and very efficient (i.e., most state stays the same between evaluated
expressions).
(c) Meta-level checkers can be written on top of the explored state
space graph.
(d) We can do some very elaborate caching/pruning of paths by
explicitly recording the explored state space graph.
(e) The invariants are simple: don't scribble on old data. This makes
reasoning about the correctness of the analyzer more "inductive" and
compositional. I cannot stress how important this is, as the nature
of path-sensitive analysis quickly gets complicated enough that
without such compositional reasoning there is little hope and getting
things right.
Prior to working the static analysis component of Clang, I worked on a
few different static analysis systems, many based on the approach of
symbolic execution. I generally found that implementing these tools
in imperative languages like C often forced an implementation based on
imperative thinking where a bunch of information was always thrown
away because it was deemed "too hard" or "too expensive" to keep
around. Later, when I implemented some of the same concepts using
functional languages (e.g., OCaml), I found that because most of the
symbolic information between states is usually the same it was (a)
cheap and easy to keep full state information around and (b) it made
the implementation much simpler and more resilient because all data
was immutable. When I started work on the analyzer in Clang, I wrote
the ImmutableMap and ImmutableSet data structures (which were inspired
by the Map and Set modules in OCaml) so that we could harness these
concepts in an analyzer written in C++.
While I have not done extensive performance evaluation, so far the
approach appears to work well, and the core of the analyzer was able
to be brought up in a fairly short time. The data structures
ImmutableMap and ImmutableSet use a fast pool allocator to allocate
all shared state (i.e., the maps and sets) created by a single
"factory" object. Once one is done analyzing a function, all of that
memory is reclaimed immediately without running any destructors,
calling 'delete' or 'free' on each object, etc.
The ImmutableSet and ImmutableMap classes also provide referential
canonicalization. For example, creating an ImmutableMap where I first
insert "3" and then "4" leaves me with the same map (as in they can be
compared using a single pointer comparison to see if they are the
same) if I did an insert of "4" and "3" with any number of removals or
insertions in between.
Finally, the use of functional data structures and functional
reasoning in the analyzer also opens up some long term possibilities.
The analyzer can be potentially be highly parallelized, with different
paths within a function explored simultaneously on different threads.
Because data is immutable, one can worry substantially less about one
thread trouncing on the data and invariants of another thread.
I apologize if this was a bit of a ramble, and I would be happy to
talk about more specific points.
Ted
More information about the cfe-commits
mailing list