[cfe-dev] Interprocedural data flow analysis

Mon Aug 25 15:09:15 PDT 2008

On Aug 23, 2008, at 10:07 AM, Kay Röpke wrote:

> Hi!
>
> Due to a relatively involved bughunt over the course of the last week,
> I've got the silly idea to look into static analysis to guard against
> the mistake in the future ;)

Hi Kay,

It sounds like a great idea!

>
>
> The problems I'm looking detect involve inter-procedural analysis
> since that's what has been biting us. Specifically I need to look for
> accesses to certain data structures and make sure those accesses are
> properly bracketed by calls to functions (only a small set of known
> functions, which makes it easier to detect - doesn't need to be fully
> generic for this problem at hand). It gets worse by the fact that the
> data structure is passed around quite a lot and that the name is
> reused all over the place.

libAnalysis currently doesn't have any boilerplate infrastructure for  
doing inter-procedural analysis.  We do provide two solvers for intra- 
procedural analysis (see below) on which you could build inter- 
procedural analysis, but there is some heavy lifting involved on your  
part.  Building a generic inter-procedural analysis engine is  
something we would like to do, but it's a big project that will be  
tackled over a long period of time.
>
> Effectively I guess I need to perform some kind of "alias" analysis.
> I'd like to do it at the source level, though in theory I suppose I
> could compile the code and perform those checks on LLVM bitcode/IR,
> too. For the tool to be most helpful, it should be able to report
> source locations, though.

Having good source information is one of the prime motivators that  
"clang static analyzer" was implemented on top of clang ASTs instead  
of the LLVM IR.

>
>
> In order to get a better feeling for what's in clang today, I
> performed some experiments. Here's a very simple example I've been
> playing around with:
>
> int first(int param1) {
> 	int first1 = param1;
> 	return param1 + 1;
> }
>
> int main(int argc, char *argv[]) {
> 	int main1;
> 	int main2 = main1;
> 	main1 = first(main2);
> 	return 0;
> }
>
> Running it with:
> 	clang -warn-dead-stores -warn-uninit-values -cfg-dump main.c
> emits the warnings (three in total) in two separate blocks, and does
> not warn about an uninitialized value for param1 in first(). The order
> and format of the output suggests that each function is looked at
> separately right after it has been parsed. Is that a correct
> observation?

That is correct.  These particular analyses are local (intra- 
procedural).  The uninitialized values analysis (which in this case is  
a very simple check) doesn't know anything about the actual value of  
param1 in the context of the call to 'first' in 'main'.

> So far I seem to have not enough knowledge about the code
> to verify it myself (note to self: spend more time in the debugger :))
>
> Further, I've not been able to answer the question whether there is
> existing code for transfer functions that performs any chasing of
> CallExprs. LiveVariables does not seem to have special code to
> "follow" calls, although I see a bunch of other Visitxxx methods.

You can define arbitrary transfer functions for arbitrary expressions  
(see below).

> Would it be as "simple" as performing something like LiveVariables
> does, but implementing VisitCallExpr and hooking that up? It seems
> like it, but since I'm not familiar enough with it yet, I thought I'd
> ask first before programming myself into a corner...

Sort of.  The difficult is handling the inter-procedural information.   
Basically, you need to write the glue yourself to capture the  
information from analyzing one function and propagating it when  
analyzing another.  There are many ways to do this.  Here is one  
generic way:

- Build "summaries" of each analyzed function, which summarizes the  
effect of calling that function (e.g., the function "first" returns  
"uninitialized" if "param1" is uninitialized).  The detail of the  
summary depends on your analysis.  Summaries can be computed out-of- 
order, lazily as you analyze one function and it calls another, etc.   
When you encounter a CallExpr when analyzing a function, you then  
consult the called function's summary to evaluate its effects.

There are many variations on how this can be carried out.

> Given the above observation about the time the analysis is performed,
> it seems to me that I at least need to postpone my analysis after
> everything is parsed. Does that make sense? Or is the order of the
> output purely coincidental in that it simply visits the blocks in that
> order and prints out diagnostics associated with the subgraph?

libAnalysis currently consists of two static analysis "engines":

(1) a flow-sensitive dataflow solver
(2) a path-sensitive dataflow engine

Both are intra-procedural analysis engines, but there is no reason why  
one could not implement inter-procedural analyses with them.

(1) is used by LiveVariables and UninitializedValues.  It merges  
information at confluence points using a specified "merge" operation.   
The uninitialized values analysis built on the flow-sensitive solver  
basically implements the -Wuninitialized check in gcc (a simple quick  
check for the use of uninitialized values).

(2) is used by the retain/release memory checker, and also does a  
bunch of built-in checks such as check for null dereferences, uses of  
uninitialized values along a specific path, etc.

I'll first talk about the flow-sensitive solver (1).

For (1), you basically implement visitor methods in your transfer  
function object (as you have already noted).  Essentially, the method  
BlockVisit is called for each block-level expression in a CFG basic  
block.  To implement your transfer function object, typically you  
subclass CFGStmtVisitor<...> as defined in CFGStmtVisitor.h, which  
implements default visitor methods that essentially do nothing.  You  
then override the methods that you care about.

For example, the dataflow solver calls BlockStmt_Visit on a  
CallExpr*.  The default implementation of BlockStmt_Visit calls  
BlockStmt_VisitCallExpr. The default implementation of  
BlockStmt_VisitCallExpr calls BlockStmt_VisitExpr.  The default  
implementation of BlockStmt_VisitExpr then calls BlockStmt_VisitStmt.   
As you see, you override the method with the level of granularity that  
you care about.

Note that the visitor methods will not (by default) do a recursive  
walk of an expression.  You either have to implement the recursion  
yourself (as done by UninitializedValues.cpp) or use an enhanced  
visitor class that does the recursion for you (as done in  
LiveVariables.cpp).

To do inter-procedural analysis, essentially you need to implement  
your own transfer function logic for CallExpr* (which may involving  
spawning off a new analysis for the called function or consulting a  
precomputed summary) or you use the intra-procedural data flow  
analysis to create a data structure which you can then later use to do  
your IPA (i.e., you construct just the facts you need from the  
analysis of a function to do a combined IPA over multiple functins).   
You really have a lot of flexibility; it just depends on your analysis  
needs.

For (2), the transfer function mechanism right now is still in flux.   
I'm gradually building up the infrastructure to "layer" or "compose"  
various transfer functions.  The path-sensitive solver operates on an  
explicit reachable state graph, where each node in the graph refers to  
a location in the CFG and a "state" representing the values of your  
analysis.  The transfer functions simply generate new states/graph  
nodes.  Some of the transfer function logic in place in the path- 
sensitive engine do some things you probably don't want to re- 
implement: constant propagation, local aliasing relationships,  
symbolic value tracking, and so on.  The hope is by having a layered  
approach one will be able to create difference checkers that represent  
the composition of different sub-analyses, with the checker writer  
only caring about writing the code that checks the property they care  
about.  This clean composition isn't there yet, but an example of how  
this can be done is with the retain/release checker (CFRetain.cpp).

In order to do IPA within a file, your best bet is probably to wait  
until the entire file (TranslationUnit) has been parsed.  You can then  
visit the function declarations in any order you wish, build up the  
necessary mappings you need, etc.  Note that this kind of IPA is  
restricted to analyzing the functions with a single file.  If your  
function implementations are scattered across multiple source files,  
you'll need to do your IPA by analyzing your functions locally,  
building up the information you need to glue their analysis results  
later, and then persistently store those results on disk (for later  
composition).  As I mentioned before, we're working on adding the  
necessary infrastructure to do full IPA across an entire codebase, but  
it's a big project and will take some time.