[cfe-dev] my experience with clang
Ted Kremenek
kremenek at apple.com
Mon Jan 7 16:01:27 PST 2008
Hi Nuno,
Sorry for the late response to this email. As I promised in my
personal communication, I wanted to take a look at what you did in
some detail after the holidays so that I could share it with the
list. I think it is exciting what you were able to do with clang in
such a short time. Comments inline.
On Dec 22, 2007, at 5:47 AM, Nuno Lopes wrote:
>
>
> I used clang to make a static code analyzer tool, based on the
> ARCHER paper
> from Stanford, albeit simpler. It is able to detect both static and
> dynamic
> memory overflows. It only supports intra-procedural analysis. It also
> provides analysis for the PHP interpreter API varargs functions
> (printf
> style).
For those of you who are not familiar, Archer essentially performs a
DFS traversal of the possible "states" of a program/function, starting
from the function entrance. Each state consists of a set of
constraints on the values of one or more variables; e.g. that the
variable 'x' is in the range '0' to '5'. These symbolic ranges can
then be used to flag possible buffer overruns by determining if an
index value could ever exceed the bounds of an array.
I like how your implementation resembles an interpreter. While the
implementation style varies from using a dispatches based on switch
statements or a chain of if..else's, overall it's the same mindset
that I believe was used to construct the original Archer tool, and a
similar design view will be incorporated into the path-sensitive
solver in clang.
The checking of the parameters for PHP is also really nice. With not
that much code you were able to write a custom check for a code base
that in practice can be really useful.
Regarding your implementation of the buffer overrun checker, one thing
that I wasn't certain about was whether or not your analysis did any
backtracking when it encountered an infeasible state. For example:
if (x == 1) // do something
...
if (x == 1) // do something
Here the two branches are control-dependent, and there are only two
paths here. Backtracking in the DFS once an infeasible path is
encountered can greatly reduce your search space, and thus increase
your scalability. From what I can tell the checking of constraints is
lazy; I believe they are only checked at an array access, and this is
to only determine if an out-of-bounds access can occur, not if the
state is infeasible. For checking feasibility of paths, one should
try and do this eagerly to prune off as much of the search space as
possible. Explicit state space exploration with caching in the worst
case is exponential, but never caching means that the algorithm is
always exponential. It also means that loops will be explored
indefinitely, and thus cause certain paths to never be explored
because of the nature of DFS.
As a possible algorithmic optimization, while the Archer paper did not
use state caching (arguing that it didn't help), I believe that one
could use liveness information to prune variables from a state, and
thus increase the possibility of a cache hit. Caching is one of the
key ways that analyses that perform an explicit state space
exploration actually achieve real scalability. Clang has a liveness
analysis already available, so it could be employed for this purpose.
>
> In case someone is interested, the full source-code is available at:
> http://web.ist.utl.pt/nuno.lopes/sirs-project.tar.bz2
> It also includes a presentation of the project in Portuguese, as
> well as
> some examples of bugs that it is able to find.
FYI: Using Google's translation tools, I was able to read most of your
presentation. It's a good short presentation. Thanks for providing it!
> My code doesn't use the clang analysis framework, as the path-
> sensitive
> analyzer wasn't ready by the time I started the project.
>
> So, about clang.. It is a very nice tool with a low learning curve.
> really.
> I once tried to look to the gcc code and I gave up (I admit I didn't
> try too
> much, but..). From all the compiler tools I've worked so far, clang
> proved
> to be the easiest one. This is due to the nice C++/OOP usage, as
> well as an
> intuitive AST (if you know C, you know how the AST looks like).
> A con of clang in the point of view of code analysis is that clang is
> optimized for IDEs. That means that some AST nodes could be removed
> altogether (e.g. ParenExpr). Also, similar expressions are represented
> differently:
>
> int x=2;
> and
> int x; x=2;
>
> This makes sense in the IDE world, but only makes things more
> difficult in
> the analysis world. But I'm not sure how clang could be improved any
> further
> about this point.
This is an excellent point, and as I believe you are implying, such
things are an inherit tradeoff of trying to capture more of the
lexical information of the program in the ASTs. There are two things
I would like to add about this point when it comes to thinking about
writing source-level tools.
The first point I would like to mention is that in general different
analyses will "care" about certain kinds of AST nodes and be
completely uninterested in other nodes. Thus almost any analysis at
the source-level has to deal with "cruft" that it has to ignore, even
though the nodes it ignores are semantically meaningful. From this
perspective, adding things like ParenExpr, while sometimes annoying to
deal with, doesn't necessarily require much more effort to handle,
especially if you add wrapper functions (as you did in your
implementation) to essentially strip such nodes away.
The second point I would like to add concerns dealing with similar
expressions such as:
int x=2
and
int x; x=2;
In general, handling both cases should fall out automatically in most
cases simply by making the analysis *semantically* driven rather than
syntactically driven. If one reasons about the semantics of the
expressions using first-principles, such "corner cases" often are
handled automatically. Your implementation of the buffer overrun
checker actually appears to be fairly semantically based, which means
it doesn't have to resort so much to the hacks that appear in bug-
finding tools like lint (and even gcc) when flagging certain kinds of
errors. Tools that essentially try to perform a "grep" on the AST
tend to be fairly fragile, and can miss bugs easily because the same
thing is written in a slightly different way. Of course there will
always be corner cases that tools must be engineered to handle, but
not making unnecessary assumptions about the syntactic structure of
code goes a long way.
>
> Also using clang as a gcc replacement is very difficult, mainly
> where you
> are using ./configure && make. I had to do a script to strip unknown
> options, as well as run gcc in parallel to clang (as ./configure
> usually
> checks if the compiler is able to create executable files).
This is another, extremely valid point. The current incarnation of
the clang driver is not a plug-in replacement for gcc; it doesn't
support many (most?) of its arguments. There are many parties
interested in developing a driver or evolving the current one to
provide (near) plug-in replacement for gcc. As usual, anyone
interested in helping out with this effort is more than encouraged to
submit patches!
Stepping back a bit, almost all source-code analysis tools like those
vended by GrammaTech and Coverity use various techniques to intercept
the compiler, parse the source file that the compiler would parse,
store some representation of the parsed file on disk, and then invoke
the regular compiler as usual. The actual source-code analysis is
then done offline, apart from the actual build of the codebase. The
reason for this is that static analysis tools that are interprocedural
require a whole-program image, which really can only be done after
"capturing the build" of the entire codebase.
We are currently building infrastructure that can be used to capture
the build for performing interprocedural analysis. Essentially we
will intercept the native compiler, serialize the ASTs to disk, and
then later perform some indexing on the serialized ASTs. Capturing
the build of course is not only useful for finding bugs, but for other
kinds of analyses as well, and so it is something we hope to do right
in clang to serve various clients.
> If I would recomend clang? Yes, sure! Although the API is not
> stable, it's
> still a nice framework.
Wonderful! Natural we hope to have the API stabilize through
continued use and development of the front-end.
> I hope this is not the end of my work in the compiler world :)
So do we.
Thanks for sharing with us your experiences with using clang!
Cheers,
Ted
More information about the cfe-dev
mailing list