[LLVMdev] LLVM for static code analysis

Mon Dec 10 15:13:13 PST 2007

On Dec 9, 2007, at 5:22 AM, Emmanuel Bastien wrote:

> Hi,
> Apart from the Calysto project (http://www.cs.ubc.ca/~babic/index_calysto.htm 
> ), is there any other static code analysis tool based of the LLVM  
> framework ?
> Calysto may be great but it seems that the source is not available  
> (yet?).
> I was quite excited by Oink/Elsa few years ago but the project is  
> almost dead even if the C++ parser is far from being complete.
> It seems to me that everything is ready in LLVM to build industrial- 
> strength static analysis tools. Clang is of course a big step  
> towards real-time parsing and IDE integration but the quality of  
> llvm-gcc should be enough for many practical applications.
> I am interested in automated code review, coverage and metrics. as  
> done by commercial products like Parasoft C++test. What I am not  
> sure yet is whether the LLVM IR is rich enough for the job or if I  
> should wait for the dedicated C++ ASTof clang.
>
> Best regards,
> Emmanuel Bastien
> Amadeus IT Group SA

Hi Emmanuel,

We are currently building a static analysis framework as part of  
clang.  The goal is to provide a framework for a variety of tools that  
could benefit from source-code level analysis, with a particular focus  
on bug-finding (and possibly verification) tools.  This work is  
currently in the early stages, but we expect it to rapidly progress  
over the next 6 months.  Naturally this work would target what  
languages are currently supported (or partially supported) by clang (C  
and Objective-C), but of course the framework could naturally progress  
to analyzing C++ as that language becomes supported by the frontend.   
We currently already have a library in clang for performing flow- 
sensitive, intra-procedural dataflow analyses, and plan on eventually  
providing a framework for inter-procedural, path-sensitive analysis  
over entire code bases.  If you are interested in following the  
progress of this work, I encourage you to subscribe to the cfe-dev  
mailing list.  You are also more than welcome to get involved in the  
actual development of this framework by submitting patches or  
providing feedback.

Aside from our plans, it is probably worth me taking a moment to  
explain why we are even implementing a source-level analysis  
framework, especially when LLVM already supports an IR for analysis  
and transformation.  The motivation for providing the ability to  
perform static analysis at the source-level all comes down to  
tradeoffs.  The LLVM IR has some truly beautiful properties such as an  
SSA-form and a low-level IR that is essentially a typed assembly  
language.  The IR can capture much of the type information of the  
original program while still providing a lowered program  
representation that simplifies many kinds of analyses and program  
optimizations.  This lowering, however, is also a double-edged sword.   
Much of the original (high-level) type information of the program is  
discarded in the LLVM IR, which becomes extremely important when we  
start talking about analyzing objected-oriented languages or any  
language that has a rich type system.  Such information can be  
extremely useful when improving the precision of an analysis, or  
simply for providing diagnosable output for a user concerning possible  
bugs found by the tool.  Moreover, a source-level analysis framework  
captures a wide variety of other sources of information from the  
program, such as macros, templates, scope, loop constructs, accurate  
information regarding variable and function names, etc.  All of these  
things are marginalized away in the lowering to LLVM IR.  It is also  
in many cases much easer to provide diagnosable output to the user  
about potential bugs when full source-level information is available  
(which is more than just lines and column information which may be  
present in a .o file's debugging information or an LLVM bitcode  
file).  Of course analyzing the original source can be much messier; a  
language like C contains far more esoteric edges cases to reason about  
than the LLVM IR.

Most state-of-the-art (commercial) bug-finding tools based on static  
analysis operate on an IR that is close to the source-level.  Analysis  
tools that operate on Java code can often get away with doing just  
analysis on the bytecode level since the bytecode contains enough  
information to recreate much of the original Java program (the type  
system of Java is captured explicitly in the bytecode).  Nevertheless,  
this isn't always enough information.  Things especially get difficult  
in a language like C++, where macros, template instantiation, and  
operator overloading can significantly obfuscate the mapping between a  
lowered IR such as that used by LLVM and the original source code.   
There are many other tradeoffs between doing source-level and LLVM IR- 
level analysis.  Which one you use at the end of the day depends on  
your application and your precise goals.  Of course many bug-finding  
analyses could actually be done (well) at the LLVM IR level, while  
others could be done far more successfully at the source-level.

Finally, an analysis framework that allows us to reason statically at  
the source-level about the properties of C/Objective-C/C++/whatever  
programs only provides another tool in the LLVM toolbox.  When  
building a bug-finding tool, one can potentially use both the LLVM IR  
and the source-level analysis framework that will be built into clang,  
although we believe that in order to build a successful (static) bug- 
finding tool a good source-level analysis framework is a prerequisite  
piece.

Ted