[cfe-dev] Static analysis tool development
Ted Kremenek
kremenek at apple.com
Fri Jan 16 15:42:34 PST 2009
On Jan 16, 2009, at 8:46 AM, Monty Zukowski wrote:
> Ben Laurie of Google has up to $50K to spend on a pilot project to
> improve the state of static analysis of C code for open source
> projects. Among other things Ben is involved with the OpenSSL project
> and has tried some static analyzers such as Deputy, and Cyclone (which
> is also a language extension of C), and has noted various problems and
> limitations with these tools.
Hi Monty, Ben,
It's great to hear about your interest! Since some of your questions
are fairly broad, I have tried to respond to your questions with a
varying levels of detail. Please do not hesitate to follow up with
specific questions.
> The goal of this pilot project is to get a static analyzer tool
> developed/modified so that it is truly useful to the open source
> community and can become a standard part of the development process.
> The ability to customize the analysis is strongly desired. For
> instance, after a security exploit is reported people might want to
> review the rest of the code for the same problem. An analyzer that
> helped do that would be quite beneficial.
>
> If the pilot project goes well then additional funding is possible.
From my perspective there are three kinds of extensibility when it
comes to analysis tools:
[1] The analyzer core can be modified or extended with new program
analysis algorithms to meet the needs of different clients (e.g.,
provide the flexibility to tradeoff between accuracy and computational
complexity, etc.).
[2] The analyzer core can be reused in a variety of contexts, such as
in a standalone command-line tool or an IDE.
[3] The analyzer can be extended with new sets of "checks" by not
invasively modifying the analyzer internals. Such extensibility can
be provided with different layers of abstraction from the core
analyzer API, with very high-level checks possibly being written in a
very high-level manner (e.g., declarative) while some checks requiring
more direct access to the core analyzer APIs. Both ends of the
spectrum are important because some checks will require sophisticated
algorithms with deep reasoning about the program while others might be
much simpler and just clue off of common interfaces and APIs.
The analyzer core in Clang is intended to (eventually) meet all three
goals. So far efforts have focused on [1] and [2], although there has
been some work on [3] with regards to checks that directly use the
analyzer's C++ APIs. Providing higher-level interfaces to declaring
checks is a great long term goal, but my feeling is that it makes more
sense to mature the core technology first.
As a high-level design strategy, the analyzer core is written as a
library (libAnalysis). This makes it inherently much easier to reuse
in difference settings, which directly addresses goal [2]. The
structure of libAnalysis is fairly modular, with a set of key abstract
interfaces put in place to allow us to swap in and out different key
components of the analyzer. For example, the engine that reasons
about feasible paths within a function uses a "ConstraintManager"
interface that is responsible for reasoning about the possible state
of possible values along a path. Different implementations of
ConstraintManager can be implemented to provide different levels of
precision depending on the needs of clients.
Another high-level goal of the analyzer is to support the relaying of
rich diagnostics to end-users about how a bug manifests in their
program. The diagnostic reporting mechanism in the analyzer also uses
a set of abstract interfaces so that bug reports can be rendered in a
variety of ways (e.g., to the console, to an HTML page, within an IDE,
etc.). Providing rich diagnostics is an important goal because
without them the results of a static analysis algorithm is only useful
to graduate students studying program analysis techniques rather than
programmers who want to fix bugs.
>> From the research I have done, clang seems to be the best front end
> for this type of project. I have some questions:
>
> 1) What is the state of the static analyzer? Where do I start
> learning what needs done on it? Is there a long term plan for it?
The talk I gave last August provides a broad overview of the project
and its state:
Finding Bugs with the Clang Static Analyzer
http://llvm.org/devmtg/2008-08/
More information on using the analyzer can be found at:
http://clang.llvm.org/StaticAnalysis.html
There is currently no real "internals" documentation on the analyzer,
which of course is something that needs to be written. Perhaps the
easiest way to look at the state of the analyzer is to look at the
implementation of specific checks in the lib/Analysis directory of the
Clang sources. Another good starting point is the file Driver/
AnalysisConsumer.cpp; this file corresponds to the code in the driver
that invokes the static analysis algorithms on a specific source file.
At a high level the analyzer consists of two analysis "engines":
[a] A flow-sensitive, intra-procedural dataflow engine.
[b] A path-sensitive, intra-procedural (and eventually inter-
procedural) dataflow engine.
FLOW-SENSITIVE ENGINE
Engine [a] mirrors the logic that typically goes on for flow-sensitive
compiler checks, e.g.: live variables analysis, basic uninitialized
variable checking, etc. The analyzer also has a "dead stores" checker
that is based on [a]. The benefit of [a] is that the running type of
an analysis is linear. The downside of [a] is that information can be
lost where paths merge (e.g., at confluence points after branches).
Information on the theory and algorithms of [a] can be found in fairly
standard compiler texts such as:
Advanced Compiler Design and Implementation
author: Muchnick
Compilers: Principles, Techniques and Tools ***Second Edition***
authors: Aho, Lam, Sethi, Ullman
In libAnalysis, two analyses use [a]: LiveVariables.cpp and
UninitializedValues.cpp.
PATH-SENSITIVE ENGINE
Engine [b] is a "path-sensitive" dataflow analysis engine which
essentially is a symbolic simulator. This engine reasons about path-
specific bugs such as null-dereferences, memory leaks, etc. It's
basically the core technology of what we generally equate with an
"advanced" static analysis tool.
At a high-level, [b] reasons about the "reachable states" of a program
be exploring an abstract state-space graph (represented as an
"exploded graph"; ExplodedGraph.[h,cpp]). A "bug" is essentially a
path that reaches a state that is considered to be "bad" (e.g., a null
pointer was dereferenced). States are simply a collection of symbolic
values representing an abstraction of a program's state at a
particular location in the program.
Operationally, the analyzer essentially interprets each expression in
a program's source code and reasons about the "state transitions" by
considering the effect of an expression on a given state. From
dataflow theory, the theoretical parlance for this operation is called
a "transfer function", but from an engineering standpoint it is
essentially a visitor function that takes as input a state and an
expression and returns a new state.
The analyzer design is inspired and similar (but in many ways quite
different) to the algorithms and ideas in several well-known papers:
Precise interprocedural dataflow analysis via graph reachability
http://doi.acm.org/10.1145/199448.199462
Checking system rules using system-specific, programmer-written
compiler extensions
http://portal.acm.org/citation.cfm?id=1251229.1251230
Symbolic path simulation in path-sensitive dataflow analysis
http://doi.acm.org/10.1145/1108792.1108808
(and several others)
From an engineering perspective, the trick is making the transfer
function modular in construction, so that some aspects handle basic
symbolic interpretation such as the manipulation of basic values
(i.e., determine the result of "i + 10") while other aspects are
"checker specific" and reason about things such as the state of a
lock, a file handle, whether or not a piece of data is considered
"unsafe", and so forth. The modular design of the transfer function
interface is one important aspect of the analyzer that needs to be
matured, although what's there isn't too bad of a starting point.
This component is so important because it represents the core axis of
flexibility with writing a diversity of checks, and thus maturing it
is necessary in order to think about how to write checks using
different high-level interfaces/
Other important components of the [b] include:
(1) ConstraintManager - the module that reasons about constraints on
symbolic values
(2) StoreManager - the module that reasons about the bindings of
variables to values. This is essentially the symbolic representation
of the heap/stack.
The current implementation of ConstraintManager is
"BasicConstraintManager", which tracks only basic equality and
inequality relationships (i.e., "a != 10", "b == 4"). This is
suitable for most predicates in system code (and for catching things
like null dereferences) but not all. One goal is to implement another
ConstraintManager that does basic range analysis (i.e., track "a >
b"). This would be especially useful for buffer overflow detection.
For StoreManager, we are actively using BasicStoreManager, which just
tracks bindings for variables on the stack. Zhongxing Xu has put a
fair amount of work into developing a new StoreManager called
"RegionStoreManager" that will extend the analyzer's abilities to
reason about the values of fields, arrays, etc. We certainly invite
anyone who is interested in helping complete this feature to get
involved.
Probably the other biggest point worth mentioning is that the path-
sensitive engine is currently restricted to function-at-a-time
analysis. Doing inter-procedural analysis is something that is
definitely planned, but requires different levels of infrastructure
depending on the scope of the analysis. In the most general case we
need to be able to perform whole-program analysis across source
files. This requires being able to build an entire image of the
program (including information about linkage so we know we know the
exact corresponding definitions for called functions). There are a
variety of ways to tackle this task, and this is a major piece of
infrastructure that certainly could benefit from anyone wishing to
help get involved.
> 2) The ability to add custom logic to the analyzer is quite desirable.
> Perhaps this could be made easier by integrating with a scripting
> language like Python. To me, even the ability to write a script to
> pattern match against the AST or other intermediate forms could be
> quite useful.
This is a great long term goal. My feeling is that the core analyzer
logic needs to converge a little further first so that checks can be
readily written using the C++ API. Afterwards using a higher-level
language like Python or Javascript seems like an excellent project.
> 3) Simply managing the volume of warnings can be difficult. I would
> like to integrate some method of tracking warnings from build to build
> to see what's new and perhaps to be able to prioritize what should be
> investigated first. This would probably be separate from the
> analyzer, but a useful front end will help the tool get adopted more
> readily.
"Issue tracking" would useful for a wide range of clients. To do this
well, however, I don't think it can be done completely separate from
the analyzer. Issue tracking requires rich diagnostics that describe
a bug but also must be insensitive to code churn that has no bearing
on a particular bug or false positive. In some cases this may require
issuing queries to the analyzer to extract more information. As a
first cut, however, the output could be based entirely on the
diagnostics.
As an interesting side note, the developers of Adium (http://www.adiumx.com
) have actually been regularly scanning their sources using Clang's
static analyzer. As part of their process, they wrote some scripts
that inspect the emitted diagnostics (in this case HTML files) and did
some munging to determine if the a report emitted on a given scan was
the same as in a prior scan. Their goal is to avoid re-inspecting
previously diagnosed bugs and false positives. Having a general
mechanism in the analyzer library would be quite useful, and would be
a good first order project for anyone who is interested in getting
involved.
> 4) Annotations can be helpful to guide an analyzer. How difficult
> would it be to extend the parser to accept a simple annotation syntax?
C and its derivatives are complex languages, so it really depends on
the annotation syntax. For example, it is really easy to add GCC-
style attributes to Clang (whose syntax could then be wrapped by
macros):
http://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html
The analyzer already exploits a variety of attributes such as
"nonnull", "unused", "IBOutlet" (the latter being added for an
Objective-C check the analyzer performs), and new ones could readily
be added as well. It all depends on what you want to express.
> I am open to collaborating on this project if anyone is available.
That would be wonderful. My suggestion is to start with a simple set
of annotations for which the corresponding checks are not too
complicated. The design can then iterate from there. There has been
a lot of work on annotation systems, and probably the most traction is
to start with annotations that would have the widest practical benefit
(i.e., low cost of adoption, easy implementation, etc.).
> Also, if you would like to learn more about this project or submit
> your own proposal, feel free to contact "Ben Laurie"
> <benl at google.com>.
If you wish to discuss more aspects of the analyzer I highly recommend
you subscribe (if you haven't already) to cfe-dev. For those who wish
to get directly involved in development, we discuss patches on cfe-
commits.
I mentioned a fair number of points in this email regarding directions
for the static analyzer. Two particularly important action points
(which are fairly accessible to anyone wishing to get involved) are:
1) improving the web app to manage false positives
2) run-to-run tracking of analysis results
Both points go right to the issue of user workflow and the overall
usability of the analyzer, and don't require (at least initially)
having an intimate knowledge of the analyzer's internals or its
algorithms.
If you have any specific questions, please do not hesitate to ask.
Best,
Ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20090116/03582b21/attachment.html>
More information about the cfe-dev
mailing list