[cfe-dev] usage of clang in an university project

Mon Nov 19 12:01:48 PST 2007

On Nov 18, 2007, at 9:39 AM, Nuno Lopes wrote:

> Hi again,
>
> Sorry for my late response, but I've been busy with other stuff.
> I'm writing firstly to make sure I want to do sane things and that  
> they are doable with current clang's CFG infrastructure.
>
>
> For example, one of the errors I would like to detect is the  
> following (from a patch that fixed a crash in PHP):
>
> - if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(4, &domain,  
> &msgid1, &msgid2, &count, &category) == FAILURE) {
> + if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(5, &domain,  
> &msgid1, &msgid2, &count, &category) == FAILURE) {
> WRONG_PARAM_COUNT;
> }
>
> (ZEND_NUM_ARGS() is just an int variable).
>
> My question is how can I track those values? I think that tracking  
> them in a more general way (e.g. var > 5; var2 < var1+3) needs a  
> full SAT solver. But simplifying things, is this doable with the  
> PersistentMap, for example?

Hi Nuno,

Static analysis technology can be based on a whole variety of  
"solvers" that perform symbolic logic.  Bit-blasting to propositional  
logic and solving with SAT is only one approach.  Bit-blasting, while  
extremely powerful, is a very heavy hammer, and there are hundreds of  
tradeoffs when choosing the right solver and analysis method for a  
particular problem.  Sometimes a "simpler" approach, such as those  
based on traditional dataflow analysis, can accomplish the task just  
as well, but in the process be more scalable.  It really depends on  
your problem, but you certainly don't *need* a SAT solver for tracking  
and solving the constraints you described, nor would it always be the  
best approach.

For example, there are many techniques out there use specialized  
solvers within an inter-procedural, path-sensitive, dataflow  
framework.  For example, I highly suggest checking out the work by  
Dawson Engler on his group's work on Metacompilation (and a related  
project at MSR was ESP):

Checking System Rules Using System-Specific, Programmer-Written  
Compiler Extension
Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem.
Appeared in: Proceedings of the 4th Symposium on Operating System  
Design and Implementation.
http://www.stanford.edu/~engler/mc-osdi.pdf

Manuvir Das, Sorin Lerner, and Mark Seigle
ESP: Path-Sensitive Program Verification in Polynomial Time
PLDI '02: Proceedings of the ACM SIGPLAN 2002 Conference on  
Programming Language Design and Implementation, Berlin, Germany, June  
2002.
http://research.microsoft.com/esp/pubs/esp-dataflow.pdf

If you want an example of a specialized solver built on top of the  
Metacompilation work, check out Archer, which found buffer overruns in  
systems code such as the Linux kernel:

ARCHER -- An Automated Tool for Detecting Buffer Access Errors
Yichen Xie, Andy Chou, and Dawson Engler
Proceedings of ESEC/FSE 2003, Helsinki, Finland
http://theory.stanford.edu/~yxie/archer.pdf

I have also seen some recent work on doing path-sensitive analysis  
using type systems, or using denotational semantics.  None of these  
are based on SAT solvers.  This isn't an argument against using SAT  
solvers; I'm just saying there are a lot of options, and it really  
depends on your problem domain.  My main argument is that often taking  
a more "light-weight" approach can get you more traction both in the  
short term, but also in terms of the overall effectiveness in the tool  
(in terms of scalability, etc.).

In terms of things that could be used to help implement such  
techniques, we now have the ImmutableMap and ImmutableSet data types  
in the LLVM libraries (see later in my email), which supersede the  
PerstistentMap implementation.  These could be used to track mappings  
from variable names to constraints (the ARCHER paper provides some  
excellent details on how such constraints can be tracked).  There of  
course needs to be some other supporting infrastructure for doing path- 
sensitive analysis, which we are working on (see below), or you would  
have to develop yourself.

> Also I would like to find memory overflow bugs, like:
> char dest[10];
> if (size <= 11)
>   memcpy(dest, input, size);
>
> At the CallExpr, how do I know that 'size' is not sanitized  
> correctly (i.e. size <= sizeof(dest))?

An excellent paper to read would be the ARCHER paper (mentioned  
above), which describes a specialized solver for finding these kinds  
of errors.  It's a fairly scalable system that had a specialized  
constraint solver that kept track of linear constraints on buffers and  
extents.  In such a system, flagging an error like this is trivial.

> Also, from which example of the Analysis dir should I base my code?

The Analysis dir contains very nascent code.  In the last couple of  
months I put in some code for a flow-sensitive dataflow analysis  
framework, and implemented "Live Variables" and "Uninitialized Values"  
on top of that framework.  As you have seen, these analyses themselves  
have bugs, some of which have been introduced as the rest of clang has  
evolved.  That said, I think it is a good baseline to build flow- 
sensitive analyses on top of, and in the process the API will be  
refined as users decide what they want (and don't want) from the  
framework.

I have also added both an ImmutableSet and ImmutableMap data  
structures to the LLVM core libraries, which implement a functional  
set and map similar to those in the Objective Caml libraries.  There  
is still some tweaking to do on their interfaces, but these ADTs were  
implemented with the intention of being used for program analysis in  
clang.  These can be used to efficiently implement dataflow "stores"  
that map from symbols to values, and their functional nature allows  
them to efficiently represent the dataflow values at different program  
points while exploiting the fact that most dataflow values across  
program points are very similar (and thus much of their representation  
can be shared).

Clang currently lacks, however, a system for doing path-sensitive  
analysis, as well as the infrastructure to support inter-procedural  
analysis.  Work on this is currently underway.  Thus there is  
currently no example code, however, in the Analysis directory (other  
than the clients of the flow-sensitive solver) on which to base your  
own code.  At this point you would have to build the necessary  
infrastructure yourself, or depending on your timeline, work with the  
libraries that we provide as we gradually bring them online.  We are  
currently building the logic into clang to serialize out its ASTs, and  
eventually build an infrastructure for reasoning about a collection of  
serialized ASTs (which could constitute an "image" of a code base, for  
example) to enable inter-procedural analysis (among other things).

Our initial goals are to provide a dataflow-based path-sensitive  
solver and supporting libraries, e.g., similar to those in ESP or  
Metacompilation, all of which are based on the standard Reps-Horwitz- 
Sagiv (RHS) inter-procedural, path-sensitive dataflow algorithm:

Precise interprocedural dataflow analysis via graph reachability
http://www.cs.cornell.edu/courses/cs711/2005fa/papers/rhs-popl95.pdf

This approach, as illustrated by the ESP and Metacompilation systems,  
permits a wide range of checking tools, and it was also the basis of  
the model-checker SLAM that was developed at MSR (although SLAM  
operated on boolean programs and solved constraints using a theorem  
prover, it was still based on RHS).

Our current goals for clang are to provide the infrastructure for  
building such tools, and we plan on focusing on building such  
libraries starting in the next month or so.  I should caution that  
developing these libraries will be a process, and will not happen  
overnight.

Note that we are not engineering clang to support only one kind of  
analysis framework.  For example, a program analysis system such as  
those based on SAT (such as the Saturn project at Stanford) or any  
other technology are all reasonable things that could be built on clang.

My advice is that you that you approach the problem that you wish to  
tackle by first simplifying it.  Don't go for the most general  
solution, and think about how you would go about solving a special  
subset of the problem.  From that you can build simple analyses, which  
can gradually be refined on a need by need basis.  I would do this  
regardless of whether or not you are using clang or some other  
framework.  If your project will be over a long period of time, you  
may potentially be able to use some of the infrastructure we are  
bringing online (and perhaps, if you are interested, you can also be  
active in its development).  Otherwise, I would stick with a project  
that is manageable and only requires that you develop the  
infrastructure you need to get some initial results.  This will allow  
you to quickly experiment with new things, which will better help you  
understand your problem domain as well as the right program analysis  
approach to solving it.

Ted