[cfe-dev] my experience with clang

Mon Jan 7 16:01:27 PST 2008

Hi Nuno,

Sorry for the late response to this email.  As I promised in my  
personal communication, I wanted to take a look at what you did in  
some detail after the holidays so that I could share it with the  
list.  I think it is exciting what you were able to do with clang in  
such a short time.  Comments inline.

On Dec 22, 2007, at 5:47 AM, Nuno Lopes wrote:
>
>
> I used clang to make a static code analyzer tool, based on the  
> ARCHER paper
> from Stanford, albeit simpler. It is able to detect both static and  
> dynamic
> memory overflows. It only supports intra-procedural analysis. It also
> provides analysis for the PHP interpreter API varargs functions  
> (printf
> style).

For those of you who are not familiar, Archer essentially performs a  
DFS traversal of the possible "states" of a program/function, starting  
from the function entrance.  Each state consists of a set of  
constraints on the values of one or more variables; e.g. that the  
variable 'x' is in the range '0' to '5'.  These symbolic ranges can  
then be used to flag possible buffer overruns by determining if an  
index value could ever exceed the bounds of an array.

I like how your implementation resembles an interpreter.  While the  
implementation style varies from using a dispatches based on switch  
statements or a chain of if..else's, overall it's the same mindset  
that I believe was used to construct the original Archer tool, and a  
similar design view will be incorporated into the path-sensitive  
solver in clang.

The checking of the parameters for PHP is also really nice.  With not  
that much code you were able to write a custom check for a code base  
that in practice can be really useful.

Regarding your implementation of the buffer overrun checker, one thing  
that I wasn't certain about was whether or not your analysis did any  
backtracking when it encountered an infeasible state.  For example:

    if (x == 1) // do something
    ...
    if (x == 1) // do something

Here the two branches are control-dependent, and there are only two  
paths here.  Backtracking in the DFS once an infeasible path is  
encountered can greatly reduce your search space, and thus increase  
your scalability.  From what I can tell the checking of constraints is  
lazy; I believe they are only checked at an array access, and this is  
to only determine if an out-of-bounds access can occur, not if the  
state is infeasible.  For checking feasibility of paths, one should  
try and do this eagerly to prune off as much of the search space as  
possible.  Explicit state space exploration with caching in the worst  
case is exponential, but never caching means that the algorithm is  
always exponential.  It also means that loops will be explored  
indefinitely, and thus cause certain paths to never be explored  
because of the nature of DFS.

As a possible algorithmic optimization, while the Archer paper did not  
use state caching (arguing that it didn't help), I believe that one  
could use liveness information to prune variables from a state, and  
thus increase the possibility of a cache hit.  Caching is one of the  
key ways that analyses that perform an explicit state space  
exploration actually achieve real scalability.  Clang has a liveness  
analysis already available, so it could be employed for this purpose.

>
> In case someone is interested, the full source-code is available at:
> http://web.ist.utl.pt/nuno.lopes/sirs-project.tar.bz2
> It also includes a presentation of the project in Portuguese, as  
> well as
> some examples of bugs that it is able to find.

FYI: Using Google's translation tools, I was able to read most of your  
presentation. It's a good short presentation. Thanks for providing it!

> My code doesn't use the clang analysis framework, as the path- 
> sensitive
> analyzer wasn't ready by the time I started the project.
>
> So, about clang.. It is a very nice tool with a low learning curve.  
> really.
> I once tried to look to the gcc code and I gave up (I admit I didn't  
> try too
> much, but..). From all the compiler tools I've worked so far, clang  
> proved
> to be the easiest one. This is due to the nice C++/OOP usage, as  
> well as an
> intuitive AST (if you know C, you know how the AST looks like).
> A con of clang in the point of view of code analysis is that clang is
> optimized for IDEs. That means that some AST nodes could be removed
> altogether (e.g. ParenExpr). Also, similar expressions are represented
> differently:
>
> int x=2;
> and
> int x; x=2;
>
> This makes sense in the IDE world, but only makes things more  
> difficult in
> the analysis world. But I'm not sure how clang could be improved any  
> further
> about this point.

This is an excellent point, and as I believe you are implying, such  
things are an inherit tradeoff of trying to capture more of the  
lexical information of the program in the ASTs.  There are two things  
I would like to add about this point when it comes to thinking about  
writing source-level tools.

The first point I would like to mention is that in general different  
analyses will "care" about certain kinds of AST nodes and be  
completely uninterested in other nodes.  Thus almost any analysis at  
the source-level has to deal with "cruft" that it has to ignore, even  
though the nodes it ignores are semantically meaningful.  From this  
perspective, adding things like ParenExpr, while sometimes annoying to  
deal with, doesn't necessarily require much more effort to handle,  
especially if you add wrapper functions (as you did in your  
implementation) to essentially strip such nodes away.

The second point I would like to add concerns dealing with similar  
expressions such as:

int x=2
and
int x; x=2;

In general, handling both cases should fall out automatically in most  
cases simply by making the analysis *semantically* driven rather than  
syntactically driven.  If one reasons about the semantics of the  
expressions using first-principles, such "corner cases" often are  
handled automatically.  Your implementation of the buffer overrun  
checker actually appears to be fairly semantically based, which means  
it doesn't have to resort so much to the hacks that appear in bug- 
finding tools like lint (and even gcc) when flagging certain kinds of  
errors.  Tools that essentially try to perform a "grep" on the AST  
tend to be fairly fragile, and can miss bugs easily because the same  
thing is written in a slightly different way.  Of course there will  
always be corner cases that tools must be engineered to handle, but  
not making unnecessary assumptions about the syntactic structure of  
code goes a long way.

>
> Also using clang as a gcc replacement is very difficult, mainly  
> where you
> are using ./configure && make. I had to do a script to strip unknown
> options, as well as run gcc in parallel to clang (as ./configure  
> usually
> checks if the compiler is able to create executable files).

This is another, extremely valid point.  The current incarnation of  
the clang driver is not a plug-in replacement for gcc; it doesn't  
support many (most?) of its arguments.  There are many parties  
interested in developing a driver or evolving the current one to  
provide (near) plug-in replacement for gcc.  As usual, anyone  
interested in helping out with this effort is more than encouraged to  
submit patches!

Stepping back a bit, almost all source-code analysis tools like those  
vended by GrammaTech and Coverity use various techniques to intercept  
the compiler, parse the source file that the compiler would parse,  
store some representation of the parsed file on disk, and then invoke  
the regular compiler as usual.  The actual source-code analysis is  
then done offline, apart from the actual build of the codebase.  The  
reason for this is that static analysis tools that are interprocedural  
require a whole-program image, which really can only be done after  
"capturing the build" of the entire codebase.

We are currently building infrastructure that can be used to capture  
the build for performing interprocedural analysis.  Essentially we  
will intercept the native compiler, serialize the ASTs to disk, and  
then later perform some indexing on the serialized ASTs.  Capturing  
the build of course is not only useful for finding bugs, but for other  
kinds of analyses as well, and so it is something we hope to do right  
in clang to serve various clients.

> If I would recomend clang? Yes, sure! Although the API is not  
> stable, it's
> still a nice framework.

Wonderful!  Natural we hope to have the API stabilize through  
continued use and development of the front-end.

> I hope this is not the end of my work in the compiler world :)

So do we.

Thanks for sharing with us your experiences with using clang!

Cheers,
Ted