[cfe-dev] usage of clang in an university project
Ted Kremenek
kremenek at apple.com
Mon Nov 19 12:01:48 PST 2007
On Nov 18, 2007, at 9:39 AM, Nuno Lopes wrote:
> Hi again,
>
> Sorry for my late response, but I've been busy with other stuff.
> I'm writing firstly to make sure I want to do sane things and that
> they are doable with current clang's CFG infrastructure.
>
>
> For example, one of the errors I would like to detect is the
> following (from a patch that fixed a crash in PHP):
>
> - if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(4, &domain,
> &msgid1, &msgid2, &count, &category) == FAILURE) {
> + if (ZEND_NUM_ARGS() != 5 || zend_get_parameters_ex(5, &domain,
> &msgid1, &msgid2, &count, &category) == FAILURE) {
> WRONG_PARAM_COUNT;
> }
>
> (ZEND_NUM_ARGS() is just an int variable).
>
> My question is how can I track those values? I think that tracking
> them in a more general way (e.g. var > 5; var2 < var1+3) needs a
> full SAT solver. But simplifying things, is this doable with the
> PersistentMap, for example?
Hi Nuno,
Static analysis technology can be based on a whole variety of
"solvers" that perform symbolic logic. Bit-blasting to propositional
logic and solving with SAT is only one approach. Bit-blasting, while
extremely powerful, is a very heavy hammer, and there are hundreds of
tradeoffs when choosing the right solver and analysis method for a
particular problem. Sometimes a "simpler" approach, such as those
based on traditional dataflow analysis, can accomplish the task just
as well, but in the process be more scalable. It really depends on
your problem, but you certainly don't *need* a SAT solver for tracking
and solving the constraints you described, nor would it always be the
best approach.
For example, there are many techniques out there use specialized
solvers within an inter-procedural, path-sensitive, dataflow
framework. For example, I highly suggest checking out the work by
Dawson Engler on his group's work on Metacompilation (and a related
project at MSR was ESP):
Checking System Rules Using System-Specific, Programmer-Written
Compiler Extension
Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem.
Appeared in: Proceedings of the 4th Symposium on Operating System
Design and Implementation.
http://www.stanford.edu/~engler/mc-osdi.pdf
Manuvir Das, Sorin Lerner, and Mark Seigle
ESP: Path-Sensitive Program Verification in Polynomial Time
PLDI '02: Proceedings of the ACM SIGPLAN 2002 Conference on
Programming Language Design and Implementation, Berlin, Germany, June
2002.
http://research.microsoft.com/esp/pubs/esp-dataflow.pdf
If you want an example of a specialized solver built on top of the
Metacompilation work, check out Archer, which found buffer overruns in
systems code such as the Linux kernel:
ARCHER -- An Automated Tool for Detecting Buffer Access Errors
Yichen Xie, Andy Chou, and Dawson Engler
Proceedings of ESEC/FSE 2003, Helsinki, Finland
http://theory.stanford.edu/~yxie/archer.pdf
I have also seen some recent work on doing path-sensitive analysis
using type systems, or using denotational semantics. None of these
are based on SAT solvers. This isn't an argument against using SAT
solvers; I'm just saying there are a lot of options, and it really
depends on your problem domain. My main argument is that often taking
a more "light-weight" approach can get you more traction both in the
short term, but also in terms of the overall effectiveness in the tool
(in terms of scalability, etc.).
In terms of things that could be used to help implement such
techniques, we now have the ImmutableMap and ImmutableSet data types
in the LLVM libraries (see later in my email), which supersede the
PerstistentMap implementation. These could be used to track mappings
from variable names to constraints (the ARCHER paper provides some
excellent details on how such constraints can be tracked). There of
course needs to be some other supporting infrastructure for doing path-
sensitive analysis, which we are working on (see below), or you would
have to develop yourself.
> Also I would like to find memory overflow bugs, like:
> char dest[10];
> if (size <= 11)
> memcpy(dest, input, size);
>
> At the CallExpr, how do I know that 'size' is not sanitized
> correctly (i.e. size <= sizeof(dest))?
An excellent paper to read would be the ARCHER paper (mentioned
above), which describes a specialized solver for finding these kinds
of errors. It's a fairly scalable system that had a specialized
constraint solver that kept track of linear constraints on buffers and
extents. In such a system, flagging an error like this is trivial.
> Also, from which example of the Analysis dir should I base my code?
The Analysis dir contains very nascent code. In the last couple of
months I put in some code for a flow-sensitive dataflow analysis
framework, and implemented "Live Variables" and "Uninitialized Values"
on top of that framework. As you have seen, these analyses themselves
have bugs, some of which have been introduced as the rest of clang has
evolved. That said, I think it is a good baseline to build flow-
sensitive analyses on top of, and in the process the API will be
refined as users decide what they want (and don't want) from the
framework.
I have also added both an ImmutableSet and ImmutableMap data
structures to the LLVM core libraries, which implement a functional
set and map similar to those in the Objective Caml libraries. There
is still some tweaking to do on their interfaces, but these ADTs were
implemented with the intention of being used for program analysis in
clang. These can be used to efficiently implement dataflow "stores"
that map from symbols to values, and their functional nature allows
them to efficiently represent the dataflow values at different program
points while exploiting the fact that most dataflow values across
program points are very similar (and thus much of their representation
can be shared).
Clang currently lacks, however, a system for doing path-sensitive
analysis, as well as the infrastructure to support inter-procedural
analysis. Work on this is currently underway. Thus there is
currently no example code, however, in the Analysis directory (other
than the clients of the flow-sensitive solver) on which to base your
own code. At this point you would have to build the necessary
infrastructure yourself, or depending on your timeline, work with the
libraries that we provide as we gradually bring them online. We are
currently building the logic into clang to serialize out its ASTs, and
eventually build an infrastructure for reasoning about a collection of
serialized ASTs (which could constitute an "image" of a code base, for
example) to enable inter-procedural analysis (among other things).
Our initial goals are to provide a dataflow-based path-sensitive
solver and supporting libraries, e.g., similar to those in ESP or
Metacompilation, all of which are based on the standard Reps-Horwitz-
Sagiv (RHS) inter-procedural, path-sensitive dataflow algorithm:
Precise interprocedural dataflow analysis via graph reachability
http://www.cs.cornell.edu/courses/cs711/2005fa/papers/rhs-popl95.pdf
This approach, as illustrated by the ESP and Metacompilation systems,
permits a wide range of checking tools, and it was also the basis of
the model-checker SLAM that was developed at MSR (although SLAM
operated on boolean programs and solved constraints using a theorem
prover, it was still based on RHS).
Our current goals for clang are to provide the infrastructure for
building such tools, and we plan on focusing on building such
libraries starting in the next month or so. I should caution that
developing these libraries will be a process, and will not happen
overnight.
Note that we are not engineering clang to support only one kind of
analysis framework. For example, a program analysis system such as
those based on SAT (such as the Saturn project at Stanford) or any
other technology are all reasonable things that could be built on clang.
My advice is that you that you approach the problem that you wish to
tackle by first simplifying it. Don't go for the most general
solution, and think about how you would go about solving a special
subset of the problem. From that you can build simple analyses, which
can gradually be refined on a need by need basis. I would do this
regardless of whether or not you are using clang or some other
framework. If your project will be over a long period of time, you
may potentially be able to use some of the infrastructure we are
bringing online (and perhaps, if you are interested, you can also be
active in its development). Otherwise, I would stick with a project
that is manageable and only requires that you develop the
infrastructure you need to get some initial results. This will allow
you to quickly experiment with new things, which will better help you
understand your problem domain as well as the right program analysis
approach to solving it.
Ted
More information about the cfe-dev
mailing list