[cfe-dev] Some problems about function summary

Ted Kremenek kremenek at apple.com
Wed Jun 8 13:54:20 PDT 2011


On Apr 27, 2011, at 6:07 AM, Zhenbo Xu wrote:

> Hi, Ted,
> 
> I am trying to extend clang static analyzer to support inter-procedural
> summary-based analysis. I have implemented a simple method for detecting
> inter-procedural memory leaks which still has some problems preventing
> doing a precise or even correct analysis.The main idea is recording the
> side effects of each function, called SE for short. The structure of SE
> is maps from statements to SVals.
> For example,  
> void f(int *x) {
>  x[0] = 1;
> }
> void g() {
>  int a = 0;
>  f(&a);
> }
> The SE of function f is stmt(x[0])->ConcreteInt(1) (Here, x[0] must meet
> a condition that the base region of stmt must be a symbolic Region
> accessed from formal argument.).
> 
> When analyzing function g, we binding the actual argument &a to the
> formal argument x(call the method EnterStackFrame() in GRState class.)
> and then iterator all of SEs in function f to evaluate the side
> effects(Use method ExprEngine::Visit(x[0]) to get the actual SVal or
> memory region of x[0] and assign it the Value Concrete(1)).
> 
> There are many problems in this method.
> 1. If the index of x is a local variable ,the analysis is incorrect.For
> instance, void f(int *x) { int i = 0; x[i] = 1; }.The actual memory
> region of x[i] is unknown.Because the value i in function f is missed in
> current summary.
> 2. If there exists assignments between arguments, the analysis is also
> incorrect. For instance, void f(int *x, int *y){ x = y; x[0] = 1; }.
> ...
> ...
> 
> Obviously, the summary info is insufficient.So I try to explore more
> summary info by collecting each function's access paths(AP).
> Access paths are symbolic names for memory locations accessed in a
> procedural.The structure of access paths is maps from statements to
> memory region. We use access paths to handle the transfer from the
> called function's statements with formal arguments to the caller's
> actual statements. For example,  
> void f(int *x, int *y) {
>  x = y;
>  x[0] = 1;
> }
> void g(int *a, int *b) {
>  f(a, b);
> }
> The access paths of function f are stmt(x) -> symbolicRegion($x),
> stmt(y)->symbolicRegion($y).In order to use access paths, the structure
> of SE is adopted as maps from memory region to SVal.In this figure , f's
> SE is ElemRegion{SymbolicRegion($y), 0} -> ConcreteInt(1).
> 
> While applying the summary info at the function's call site, we should
> translate from the called function's memory region in SE to actual
> memory region.ElemRegion{SymbolicRegion($y), 0} is translated as
> ElemRegion{SymbolicRegion($a), 0},and then assign ConcreteInt(1) to
> ElemRegion{SymbolicRegion($a), 0}.
> 
> This method is difficult or even infeasible to implement.Because the
> translation is not always reasonable and the state data such as memory
> region during intra-procedural analysis is not released which causes a
> large memory consuming.
> 
> So we should define a data structure for summary which is independent of
> GRState(or memory region).The most difficult thing is how to obtain the
> actual value from the called function's analysis data.Can you give me
> some suggestions about the design of function summary?Hope for your
> reply!
> 

Hi Zhenbo,

I was looking through my old emails, and stumbled upon this one.  I'm so sorry for not replying to it sooner.

Function summaries are a tough nut to crack.  There are a variety of issues to tackle.  Beyond memory updates, I think there are a bunch of other issues that need to be captured in a summary.  Here is a (partial) list:

1) Memory updates (changes to the symbolic store)
2) Return values
3) Return values via out parameters
4) Constraints added to symbols, which can add "path constraints" on memory updates, return values, etc.
5) Unifying symbols (alpha-renaming) in the symbolic store and path constraints
6) Modeling checker state updates
7) and possibly many others...

Let's break these down.

== 1. Memory updates: ==

Memory updates represent updates in the symbolic store.  While ExprEngine and checkers reason about MemRegions, there is no requirement that the underlying StoreManager actually model memory bindings using MemRegions.  Indeed, RegionStoreManager lowers complex regions into a base region + byte offset.  Since one can represent the address of a byte using different MemRegions (e.g., FieldRegion or an ElementRegion representing a byte offset), this lowered representation has many advantages, and can possibly do a much better job of capturing the raw underlying semantics while leaving MemRegions for the higher-level clients to "describe" how memory is being used.  This is a really powerful decoupling, that gives us the flexibility of swapping in a new StoreManager that uses an entirely different memory binding representation.

So what does this mean for summaries?  For summaries, I think this means that memory updates should be described in the underlying representation of the StoreManager.  When one creates a summary, ideally one just looks at the GRStates at the "return sites" for the function, and looks at the differences between the initial GRState and the end GRStates.  The StoreManager can then articulate a "diff", which represent the effects on the memory model, between the initial state and the end states.  That diff, however, is completely up to the StoreManager, which preserves the decoupling from the MemRegion abstraction and the underlying memory representation in the StoreManager.  Further, the StoreManager knows what it can and cannot represent; there is no reason to capture more information beyond that in the summary.

Now I want to get back to something you wrote:

>> When analyzing function g, we binding the actual argument &a to the
>> formal argument x(call the method EnterStackFrame() in GRState class.)
>> and then iterator all of SEs in function f to evaluate the side
>> effects(Use method ExprEngine::Visit(x[0]) to get the actual SVal or
>> memory region of x[0] and assign it the Value Concrete(1)).

I may be misinterpreting you, but technically this isn't summary based analysis, but inlining.  Summary-based analysis still involves analyzing functions in (partial isolation), and building up summaries that allows a summary to applied to *any* function call for a given function.  In a summary-based analysis, when we encounter a function call to a function we have previously not generated a summary, we lazily generate one (by analyzing the called function), and then apply the summary.  Whether or not we include specific context from the caller when generating the summary is a design decision, and may provide more precision depending on the analysis, but it isn't a prerequisite for summary-based analysis.

If we take the approach that summaries are generated in a context-free manner (where the caller is not taking into account), then the summary will articulate diffs in terms of the symbolic inputs and outputs representing the values of global variables, input parameters, and any other value that is "visible" to the function.  For example, when analyzing "g", we would have symbolic values for "a" and "b", and the summary would describe its effects on "a" and "b" in terms of those symbolic values.  Then, when *applying* the summary, we need to resolve the actual arguments (which can be actual concrete values) to the symbolic values used in the summary.  This is a form of alpha-renaming (taking the idea from lambda calculus).  When applying the summary, we may have many possible "diffs" to chose from (from the different GRStates at the end sites in the summarized function).  Not all of the diffs while generate GRStates in the caller that are consistent.  For example, if the caller has the constraint that an actual argument is non-null, and one of the diffs assuming that the formal parameter is null, then when applying that diff we will get an invalid GRState because the values aren't consistent.  If none of the diffs can apply cleanly, then there is no feasible path.

For memory updates, "applying the diff" is probably up to the StoreManager (since diffs aren't necessarily described in terms of MemRegions), but it has to be done in concert with the other diffs (from 2-6).

== 2. Return values ==

Like memory updates, there may be different possible return values, depending on constraints on the inputs.  This also needs to be captured in the summary.

== 3. Return values via ouput parameters ==

These are just captured via memory updates, so if we solve (1), we solve (3).

== 4. Constraints added to symbols ==

The diffs for (1-3) and (6) all may be *conditional* on specific values of the inputs to a function (e.g., the constraints on symbolic regions or specific symbolic values).  These constraints are modeled by the ConstraintManager.  Like StoreManager, ExprEngine and checkers talk to the ConstraintManager via an abstract interface.  The ConstraintManager, however, is allowed to represent constraints in GRState in any way that it wishes.  The "diff" for constraints that is captured in the summary should also be left up to the ConstraintManager.

What are things that would be in the diff?  Let's say we generate N different GRStates at end sites when analyzing a function for which we are creating a summary, and each of those N states has a different constraint on an input parameter.  There may also be other constraints modeled in the GRState, e.g., constraints on local variables, but those are irrelevant for the summary because only the input values dictate the effects as seen by the caller.  Thus, when generating the diff, we do the following:

* For each of the N end states, take the input state and end state, and summarizing the constraints added to each of the input values.  We discard any other constraints.  Likely we will do the same exact thing with StoreManager: just discard memory bindings that not visible to the caller, and summarizing the differences between the initial bindings and final memory bindings.

One thing that is cool about this approach is that we can basically just generate a new GRState that represents the "diffs", and then just alpha-rename the symbolic values to the values in the caller when applying the summary.  When alpha-renaming, we resolve both incompatible bindings and incompatible constraints between the symbolic and caller's values, and then discard any GRStates that are invalid.

== 5. Unifying symbols ===

I think I covered this in (4), but the idea is that summaries just represent constraints on input symbols from the perspective of the called function, and when we apply the summary in the caller's context, we are just "unifying" the symbols.  This can be a complex operation, and the logic can be distributed amongst the various subcomponents (e.g., StoreManager, ConstraintManager, etc.).

== 6. Modeling checker state updates ==

Like the StoreManager and ConstraintManager, checkers can also have "updates" to the state that need to be reflected in a summary.  Because the logic of individual checkers is clearly unknown to ExprEngine, checkers alone show that the construction of summaries needs to be both (a) generic and (b) allow individual subcomponents that impact what goes into a GRState to be responsible for generating the summary diff *any* for helping the alpha-renaming process when applying a summary.  This will require extending the checker interface, and possibly providing better primitives for checker writers to make this process as painless as possible.

== 7. and possibly many others... ===

There could be other things we may need to capture in a summary.  For example, do certain path constraints imply that a bug gets reported?  Or how about a path terminates?  etc.  These need to be reflected in a summary as well.


All of this of course is a big project, which is why the *inlining* approach to inter-procedural analysis is more immediately appealing for doing simple inter-procedural analysis.  With the inlining approach, we analyze the caller, and when we see a function call, we just inline the semantics of the caller in situ by literally continuing the analysis of the caller by binding actual parameters to formal parameters, and the callee's analysis is done purely in the context of caller's.  This is easy and highly precise, but won't scale beyond simple cases.

Hope this helps,
Ted







More information about the cfe-dev mailing list