[cfe-dev] Interprocedural data flow analysis

Mon Aug 25 17:58:54 PDT 2008

Ted,

many thanks for the exhaustive reply :)

On Aug 26, 2008, at 12:09 AM, Ted Kremenek wrote:

> On Aug 23, 2008, at 10:07 AM, Kay Röpke wrote:
>
> It sounds like a great idea!

I always get myself in this kind of trouble ;)

>> The order and format of the output suggests that each function is  
>> looked at
>> separately right after it has been parsed. Is that a correct
>> observation?
>
> That is correct.  These particular analyses are local (intra- 
> procedural).  The uninitialized values analysis (which in this case  
> is a very simple check) doesn't know anything about the actual value  
> of param1 in the context of the call to 'first' in 'main'.

Ok, I see.

>> So far I seem to have not enough knowledge about the code
>> to verify it myself (note to self: spend more time in the  
>> debugger :))
>>
>> Further, I've not been able to answer the question whether there is
>> existing code for transfer functions that performs any chasing of
>> CallExprs. LiveVariables does not seem to have special code to
>> "follow" calls, although I see a bunch of other Visitxxx methods.
>
> You can define arbitrary transfer functions for arbitrary  
> expressions (see below).
>
>> Would it be as "simple" as performing something like LiveVariables
>> does, but implementing VisitCallExpr and hooking that up? It seems
>> like it, but since I'm not familiar enough with it yet, I thought I'd
>> ask first before programming myself into a corner...
>
> Sort of.  The difficult is handling the inter-procedural  
> information.  Basically, you need to write the glue yourself to  
> capture the information from analyzing one function and propagating  
> it when analyzing another.  There are many ways to do this.  Here is  
> one generic way:
>
> - Build "summaries" of each analyzed function, which summarizes the  
> effect of calling that function (e.g., the function "first" returns  
> "uninitialized" if "param1" is uninitialized).  The detail of the  
> summary depends on your analysis.  Summaries can be computed out-of- 
> order, lazily as you analyze one function and it calls another,  
> etc.  When you encounter a CallExpr when analyzing a function, you  
> then consult the called function's summary to evaluate its effects.
>
> There are many variations on how this can be carried out.

Yes, I was expecting that much of the glue code would have to be  
written from scratch, but it's good to know that I seem to have  
grasped the general idea of it.

>> Given the above observation about the time the analysis is performed,
>> it seems to me that I at least need to postpone my analysis after
>> everything is parsed. Does that make sense? Or is the order of the
>> output purely coincidental in that it simply visits the blocks in  
>> that
>> order and prints out diagnostics associated with the subgraph?
>
> libAnalysis currently consists of two static analysis "engines":
>
> (1) a flow-sensitive dataflow solver
> (2) a path-sensitive dataflow engine
>
> Both are intra-procedural analysis engines, but there is no reason  
> why one could not implement inter-procedural analyses with them.
>
> (1) is used by LiveVariables and UninitializedValues.  It merges  
> information at confluence points using a specified "merge"  
> operation.  The uninitialized values analysis built on the flow- 
> sensitive solver basically implements the -Wuninitialized check in  
> gcc (a simple quick check for the use of uninitialized values).

It seems that for answering "uninitialized vs initialized" and "live  
vs dead" questions you really want to compute what I'll call "partial  
derivates" of those functions. (My math prof would kill me if read  
this...). This info can be trivially cached and each function could be  
looked at individually, as long as you know which ones are actual  
parameters of it. Trivially that's true for its actual declared  
parameters, but globals would be "parameters" as well, complicating  
the issue. It would turn into a path-sensitive dataflow problem for  
those.

> (2) is used by the retain/release memory checker, and also does a  
> bunch of built-in checks such as check for null dereferences, uses  
> of uninitialized values along a specific path, etc.
>
> I'll first talk about the flow-sensitive solver (1).
>
> For (1), you basically implement visitor methods in your transfer  
> function object (as you have already noted).  Essentially, the  
> method BlockVisit is called for each block-level expression in a CFG  
> basic block.  To implement your transfer function object, typically  
> you subclass CFGStmtVisitor<...> as defined in CFGStmtVisitor.h,  
> which implements default visitor methods that essentially do  
> nothing.  You then override the methods that you care about.
>
> For example, the dataflow solver calls BlockStmt_Visit on a  
> CallExpr*.  The default implementation of BlockStmt_Visit calls  
> BlockStmt_VisitCallExpr. The default implementation of  
> BlockStmt_VisitCallExpr calls BlockStmt_VisitExpr.  The default  
> implementation of BlockStmt_VisitExpr then calls  
> BlockStmt_VisitStmt.  As you see, you override the method with the  
> level of granularity that you care about.

That saves a lot of reading, thanks!

> Note that the visitor methods will not (by default) do a recursive  
> walk of an expression.  You either have to implement the recursion  
> yourself (as done by UninitializedValues.cpp) or use an enhanced  
> visitor class that does the recursion for you (as done in  
> LiveVariables.cpp).

Is there an inherit reason for having both types in the code, or is  
that purely historical? At first it sounds like you only want to have  
it in the enhanced visitor class, but I might be missing a key point  
here.

> To do inter-procedural analysis, essentially you need to implement  
> your own transfer function logic for CallExpr* (which may involving  
> spawning off a new analysis for the called function or consulting a  
> precomputed summary) or you use the intra-procedural data flow  
> analysis to create a data structure which you can then later use to  
> do your IPA (i.e., you construct just the facts you need from the  
> analysis of a function to do a combined IPA over multiple  
> functins).  You really have a lot of flexibility; it just depends on  
> your analysis needs.

Two passes seem easier to get a firm grip on the problem, but it  
probably requires more code. Having the entire information available  
in the second pass might help in generating better messages, though.  
For example, I noticed that the warnings from my example are printed  
out of order compared to their appearance in the source code, which is  
understandable with your explanation from above.
main.c:2:6: warning: Value stored to 'first1' during its  
initialization is never read
         int first1 = param1;
             ^        ~~~~~~
main.c:9:2: warning: Value stored to 'main1' is never read
         main1 = first(main2);
         ^       ~~~~~~~~~~~~
main.c:8:14: warning: use of uninitialized variable
         int main2 = main1;
                     ^
3 diagnostics generated.

> For (2), the transfer function mechanism right now is still in  
> flux.  I'm gradually building up the infrastructure to "layer" or  
> "compose" various transfer functions.  The path-sensitive solver  
> operates on an explicit reachable state graph, where each node in  
> the graph refers to a location in the CFG and a "state" representing  
> the values of your analysis.  The transfer functions simply generate  
> new states/graph nodes.  Some of the transfer function logic in  
> place in the path-sensitive engine do some things you probably don't  
> want to re-implement: constant propagation, local aliasing  
> relationships, symbolic value tracking, and so on.  The hope is by  
> having a layered approach one will be able to create difference  
> checkers that represent the composition of different sub-analyses,  
> with the checker writer only caring about writing the code that  
> checks the property they care about.  This clean composition isn't  
> there yet, but an example of how this can be done is with the retain/ 
> release checker (CFRetain.cpp).

I assume you mean CFRefCount.cpp? Looking at ArgEffect/RefEffect and  
RetainSummary I think you implement a variant of the "derivative" idea  
from above. I guess for dealing with refcounting you don't need the  
information how each argument is affecting the return value (or values  
for (pointer to) struct returns), though I can imagine that for  
certain situations it is helpful, for example inspecting the effect of  
a NULL parameter to a function that returns a newly malloc'ed and  
hopefully initialized structs. Especially to check more reliably if  
the client code is implementing all null checks correctly. If you can  
prove that on one argument configuration a return value will always be  
NULL that should greatly improve analysis strength.

> In order to do IPA within a file, your best bet is probably to wait  
> until the entire file (TranslationUnit) has been parsed.  You can  
> then visit the function declarations in any order you wish, build up  
> the necessary mappings you need, etc.  Note that this kind of IPA is  
> restricted to analyzing the functions with a single file.  If your  
> function implementations are scattered across multiple source files,  
> you'll need to do your IPA by analyzing your functions locally,  
> building up the information you need to glue their analysis results  
> later, and then persistently store those results on disk (for later  
> composition).  As I mentioned before, we're working on adding the  
> necessary infrastructure to do full IPA across an entire codebase,  
> but it's a big project and will take some time.

Sure. And of course, yes, the code in question is scattered over a  
hundred or so translation units :\
Worse still, most of the code is only reachable through function  
pointers, because it is written in C and heavily plugin based. I'm  
happy to find a specific solution for the set of possible targets of  
those pointers, however, because the set is small (5-10 of implemented  
functions per plugin) and pretty stable.

In the light of persisting the analysis results I think that would  
mean changes to clang.cpp as it seems to operate on translation units  
as the coarsest granularity, correct?
The full IPA tool would need to pick up all persisted files and make  
sense of it, whether that's clang or not, it would almost like a  
linker have to resolve dependencies (unless all analysis results go  
into one file which is read). hmm, I'll have to think about that one.

cheers,
-k
-- 
Kay Röpke
http://classdump.org/