[cfe-dev] Static analysis tool development

Fri Jan 16 15:42:34 PST 2009

On Jan 16, 2009, at 8:46 AM, Monty Zukowski wrote:

> Ben Laurie of Google has up to $50K to spend on a pilot project to
> improve the state of static analysis of C code for open source
> projects.  Among other things Ben is involved with the OpenSSL project
> and has tried some static analyzers such as Deputy, and Cyclone (which
> is also a language extension of C), and has noted various problems and
> limitations with these tools.

Hi Monty, Ben,

It's great to hear about your interest!  Since some of your questions  
are fairly broad, I have tried to respond to your questions with a  
varying levels of detail.  Please do not hesitate to follow up with  
specific questions.

> The goal of this pilot project is to get a static analyzer tool
> developed/modified so that it is truly useful to the open source
> community and can become a standard part of the development process.
> The ability to customize the analysis is strongly desired.  For
> instance, after a security exploit is reported people might want to
> review the rest of the code for the same problem.  An analyzer that
> helped do that would be quite beneficial.
>
> If the pilot project goes well then additional funding is possible.

 From my perspective there are three kinds of extensibility when it  
comes to analysis tools:

[1] The analyzer core can be modified or extended with new program  
analysis algorithms to meet the needs of different clients (e.g.,  
provide the flexibility to tradeoff between accuracy and computational  
complexity, etc.).

[2] The analyzer core can be reused in a variety of contexts, such as  
in a standalone command-line tool or an IDE.

[3] The analyzer can be extended with new sets of "checks" by not  
invasively modifying the analyzer internals.  Such extensibility can  
be provided with different layers of abstraction from the core  
analyzer API, with very high-level checks possibly being written in a  
very high-level manner (e.g., declarative) while some checks requiring  
more direct access to the core analyzer APIs.  Both ends of the  
spectrum are important because some checks will require sophisticated  
algorithms with deep reasoning about the program while others might be  
much simpler and just clue off of common interfaces and APIs.

The analyzer core in Clang is intended to (eventually) meet all three  
goals.  So far efforts have focused on [1] and [2], although there has  
been some work on [3] with regards to checks that directly use the  
analyzer's C++ APIs.  Providing higher-level interfaces to declaring  
checks is a great long term goal, but my feeling is that it makes more  
sense to mature the core technology first.

As a high-level design strategy, the analyzer core is written as a  
library (libAnalysis).  This makes it inherently much easier to reuse  
in difference settings, which directly addresses goal [2].  The  
structure of libAnalysis is fairly modular, with a set of key abstract  
interfaces put in place to allow us to swap in and out different key  
components of the analyzer.  For example, the engine that reasons  
about feasible paths within a function uses a "ConstraintManager"  
interface that is responsible for reasoning about the possible state  
of possible values along a path.  Different implementations of  
ConstraintManager can be implemented to provide different levels of  
precision depending on the needs of clients.

Another high-level goal of the analyzer is to support the relaying of  
rich diagnostics to end-users about how a bug manifests in their  
program.  The diagnostic reporting mechanism in the analyzer also uses  
a set of abstract interfaces so that bug reports can be rendered in a  
variety of ways (e.g., to the console, to an HTML page, within an IDE,  
etc.).  Providing rich diagnostics is an important goal because  
without them the results of a static analysis algorithm is only useful  
to graduate students studying program analysis techniques rather than  
programmers who want to fix bugs.

>> From the research I have done, clang seems to be the best front end
> for this type of project.  I have some questions:
>
> 1) What is the state of the static analyzer?  Where do I start
> learning what needs done on it?  Is there a long term plan for it?

The talk I gave last August provides a broad overview of the project  
and its state:

   Finding Bugs with the Clang Static Analyzer
   http://llvm.org/devmtg/2008-08/

More information on using the analyzer can be found at:

   http://clang.llvm.org/StaticAnalysis.html

There is currently no real "internals" documentation on the analyzer,  
which of course is something that needs to be written.  Perhaps the  
easiest way to look at the state of the analyzer is to look at the  
implementation of specific checks in the lib/Analysis directory of the  
Clang sources.  Another good starting point is the file Driver/ 
AnalysisConsumer.cpp; this file corresponds to the code in the driver  
that invokes the static analysis algorithms on a specific source file.

At a high level the analyzer consists of two analysis "engines":

[a] A flow-sensitive, intra-procedural dataflow engine.
[b] A path-sensitive, intra-procedural (and eventually inter- 
procedural) dataflow engine.

FLOW-SENSITIVE ENGINE

Engine [a] mirrors the logic that typically goes on for flow-sensitive  
compiler checks, e.g.: live variables analysis, basic uninitialized  
variable checking, etc.  The analyzer also has a "dead stores" checker  
that is based on [a].  The benefit of [a] is that the running type of  
an analysis is linear.  The downside of [a] is that information can be  
lost where paths merge (e.g., at confluence points after branches).

Information on the theory and algorithms of [a] can be found in fairly  
standard compiler texts such as:

   Advanced Compiler Design and Implementation
   author: Muchnick

   Compilers: Principles, Techniques and Tools ***Second Edition***
   authors: Aho, Lam, Sethi, Ullman

In libAnalysis, two analyses use [a]: LiveVariables.cpp and  
UninitializedValues.cpp.

PATH-SENSITIVE ENGINE

Engine [b] is a "path-sensitive" dataflow analysis engine which  
essentially is a symbolic simulator.  This engine reasons about path- 
specific bugs such as null-dereferences, memory leaks, etc.  It's  
basically the core technology of what we generally equate with an  
"advanced" static analysis tool.

At a high-level, [b] reasons about the "reachable states" of a program  
be exploring an abstract state-space graph (represented as an  
"exploded graph"; ExplodedGraph.[h,cpp]).  A "bug" is essentially a  
path that reaches a state that is considered to be "bad" (e.g., a null  
pointer was dereferenced).  States are simply a collection of symbolic  
values representing an abstraction of a program's state at a  
particular location in the program.

Operationally, the analyzer essentially interprets each expression in  
a program's source code and reasons about the "state transitions" by  
considering the effect of an expression on a given state.  From  
dataflow theory, the theoretical parlance for this operation is called  
a "transfer function", but from an engineering standpoint it is  
essentially a visitor function that takes as input a state and an  
expression and returns a new state.

The analyzer design is inspired and similar (but in many ways quite  
different) to the algorithms and ideas in several well-known papers:

   Precise interprocedural dataflow analysis via graph reachability
   http://doi.acm.org/10.1145/199448.199462

   Checking system rules using system-specific, programmer-written  
compiler extensions
   http://portal.acm.org/citation.cfm?id=1251229.1251230

   Symbolic path simulation in path-sensitive dataflow analysis
   http://doi.acm.org/10.1145/1108792.1108808

   (and several others)

 From an engineering perspective, the trick is making the transfer  
function modular in construction, so that some aspects handle basic  
symbolic interpretation such as the manipulation of basic values  
(i.e., determine the result of "i + 10") while other aspects are  
"checker specific" and reason about things such as the state of a  
lock, a file handle, whether or not a piece of data is considered  
"unsafe", and so forth.  The modular design of the transfer function  
interface is one important aspect of the analyzer that needs to be  
matured, although what's there isn't too bad of a starting point.   
This component is so important because it represents the core axis of  
flexibility with writing a diversity of checks, and thus maturing it  
is necessary in order to think about how to write checks using  
different high-level interfaces/

Other important components of the [b] include:

(1) ConstraintManager - the module that reasons about constraints on  
symbolic values
(2) StoreManager - the module that reasons about the bindings of  
variables to values.  This is essentially the symbolic representation  
of the heap/stack.

The current implementation of ConstraintManager is  
"BasicConstraintManager", which tracks only basic equality and  
inequality relationships (i.e., "a != 10", "b == 4").  This is  
suitable for most predicates in system code (and for catching things  
like null dereferences) but not all.  One goal is to implement another  
ConstraintManager that does basic range analysis (i.e., track "a >  
b").  This would be especially useful for buffer overflow detection.

For StoreManager, we are actively using BasicStoreManager, which just  
tracks bindings for variables on the stack.  Zhongxing Xu has put a  
fair amount of work into developing a new StoreManager called  
"RegionStoreManager" that will extend the analyzer's abilities to  
reason about the values of fields, arrays, etc.  We certainly invite  
anyone who is interested in helping complete this feature to get  
involved.

Probably the other biggest point worth mentioning is that the path- 
sensitive engine is currently restricted to function-at-a-time  
analysis.  Doing inter-procedural analysis is something that is  
definitely planned, but requires different levels of infrastructure  
depending on the scope of the analysis.  In the most general case we  
need to be able to perform whole-program analysis across source  
files.  This requires being able to build an entire image of the  
program (including information about linkage so we know we know the  
exact corresponding definitions for called functions).  There are a  
variety of ways to tackle this task, and this is a major piece of  
infrastructure that certainly could benefit from anyone wishing to  
help get involved.

> 2) The ability to add custom logic to the analyzer is quite desirable.
> Perhaps this could be made easier by integrating with a scripting
> language like Python.  To me, even the ability to write a script to
> pattern match against the AST or other intermediate forms could be
> quite useful.

This is a great long term goal.  My feeling is that the core analyzer  
logic needs to converge a little further first so that checks can be  
readily written using the C++ API.  Afterwards using a higher-level  
language like Python or Javascript seems like an excellent project.

> 3) Simply managing the volume of warnings can be difficult.  I would
> like to integrate some method of tracking warnings from build to build
> to see what's new and perhaps to be able to prioritize what should be
> investigated first.  This would probably be separate from the
> analyzer, but a useful front end will help the tool get adopted more
> readily.

"Issue tracking" would useful for a wide range of clients.  To do this  
well, however, I don't think it can be done completely separate from  
the analyzer.  Issue tracking requires rich diagnostics that describe  
a bug but also must be insensitive to code churn that has no bearing  
on a particular bug or false positive.  In some cases this may require  
issuing queries to the analyzer to extract more information.  As a  
first cut, however, the output could be based entirely on the  
diagnostics.

As an interesting side note, the developers of Adium (http://www.adiumx.com 
) have actually been regularly scanning their sources using Clang's  
static analyzer.  As part of their process, they wrote some scripts  
that inspect the emitted diagnostics (in this case HTML files) and did  
some munging to determine if the a report emitted on a given scan was  
the same as in a prior scan.  Their goal is to avoid re-inspecting  
previously diagnosed bugs and false positives.  Having a general  
mechanism in the analyzer library would be quite useful, and would be  
a good first order project for anyone who is interested in getting  
involved.

> 4) Annotations can be helpful to guide an analyzer.  How difficult
> would it be to extend the parser to accept a simple annotation syntax?

C and its derivatives are complex languages, so it really depends on  
the annotation syntax.  For example, it is really easy to add GCC- 
style attributes to Clang (whose syntax could then be wrapped by  
macros):

   http://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html

The analyzer already exploits a variety of attributes such as  
"nonnull", "unused", "IBOutlet" (the latter being added for an  
Objective-C check the analyzer performs), and new ones could readily  
be added as well.  It all depends on what you want to express.

> I am open to collaborating on this project if anyone is available.

That would be wonderful.  My suggestion is to start with a simple set  
of annotations for which the corresponding checks are not too  
complicated.  The design can then iterate from there.  There has been  
a lot of work on annotation systems, and probably the most traction is  
to start with annotations that would have the widest practical benefit  
(i.e., low cost of adoption, easy implementation, etc.).

> Also, if you would like to learn more about this project or submit
> your own proposal, feel free to contact "Ben Laurie"
> <benl at google.com>.

If you wish to discuss more aspects of the analyzer I highly recommend  
you subscribe (if you haven't already) to cfe-dev.  For those who wish  
to get directly involved in development, we discuss patches on cfe- 
commits.

I mentioned a fair number of points in this email regarding directions  
for the static analyzer.  Two particularly important action points  
(which are fairly accessible to anyone wishing to get involved) are:

1) improving the web app to manage false positives
2) run-to-run tracking of analysis results

Both points go right to the issue of user workflow and the overall  
usability of the analyzer, and don't require (at least initially)  
having an intimate knowledge of the analyzer's internals or its  
algorithms.

If you have any specific questions, please do not hesitate to ask.

Best,
Ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20090116/03582b21/attachment.html>