[cfe-commits] [Patch][review request]update to IteratorsChecker

Tue Mar 22 22:59:50 PDT 2011

On Mar 18, 2011, at 4:04 AM, Jim Goodnow II wrote:

> Hi Ted,
> 
> Okay, I agree and that was what I had originally intended, but ran into difficulties implementing it.

Hi Jim,

I'm not surprised there were difficulties.  Some of the things that are needed in the analyzer core to implement the right design aren't there yet.  There are also some designed decisions that need to be discussed about how to do the correct interplay between the analyzer engine and checkers for the kind of tracking you want to do.

> So, as a first step, let's talk just about something simple:
> 
> it = v.begin();
> 
> Conceptually, the MemberCall for begin() returns a symbol region which I mark as BeginValid. It is also linked with the MemRegion associated with the instance or really any 'TypedRegion' associated with the Base of the MemberCall which is the instance. When the symbol region is assigned to the iterator, 'it', the BeginValid state gets propogated to the MemRegion associated with the iterator.
> 
> First, does the returned symbol region actually get created by the engine or do I have to do that in checkPostStmt(CXXMemberCall) or somewhere else?

The symbol region gets created by the analysis engine.  A checker should never create regions on its own.  Moreover, the region itself need not always be a symbolic region, but any TypedRegion.

Breaking it down from a design perspective, the analyzer engine handles the core semantics of expressions, and checkers reason about "higher-level" invariants, preconditions, postconditions, etc., that incorporate domain-specific knowledge.  I think this statement is fairly intuitive for "basic" expressions such as binary operators, but I think it's worth discussing what it means for function calls (which includes C++ method calls, operator overloading, and so forth).

Fundamentally, the analyzer engine should handle the "evaluation" of a function call.  This allows the analyzer engine to make decisions about analysis precision by sometimes inlining a function call, using an interprocedural summary, or taking a "conservative" approach when it has no knowledge of what a function does and treats the return value as being symbolic.  The last case is all we have implemented right now (although there is prototype support for inlining), but it is important to keep in mind that we have various potential options here.  To illustrate, consider the following code:

  int g() { return 42; }
  void test() { int x = g(); }

When analyzing 'test', the return value of the call to 'g' is a symbolic value.  That's because the analyzer engine doesn't look at the body of 'g' when evaluating the semantics of the function call 'g()'.  Instead, a symbolic value is created to represent some "abstract value" that is returned from 'g'.  This is a very conservative approach, but not very precise.  In scenarios like these we can possibly improve precision by inlining some function calls.  In this case, when analyzing 'test', we'd *know* that the value returned from the call to 'g' was 42.  Not a symbolic value, but 42.

This reasoning extends to functions returning pointers or references.  In this case, we use SymbolicRegions instead of just symbols, but they are basically the same thing (SymbolicRegions are based on symbols).  However, a called function that returns a pointer is not guaranteed to return a SymbolicRegion.  For example:

  int *foo(int *x) { return x; }
  void test() {
     int a;
     int *b = foo(&a);
  }

In this example, the analyzer will currently evaluate the call to 'foo' as returning a SymbolicRegion instead of the VarRegion for 'a'.  That of course is imprecise; basic function inlining would show that 'foo' is the identity function, allowing us to analyze this code with perfect precision.

Calls to C++ functions/methods works the same way.  It doesn't matter how they are written (e.g., using operator overloading); at the end of the day they are still function calls, and we should leave the analyzer engine with the responsibility of handling their core semantics.

So the natural question is where does your checker fit into this picture?  Checkers layer their own semantics by adding constraints and meta-state.  For example, let's say you had a checker that *knew* a priori that the function 'g' in my example above *always* returned the value 42.  It just knows this fact as part of its domain-specific knowledge without looking at the body of 'g'.  How should it enforce that knowledge?

My thought is that checkers can enforce such knowledge by using the 'assume' API, which adds constraints to values.  For example, in the checkPostStmt callback, the checker can use "GRState::assume(...)" to assume that the return value is equal to 42.  In the conservative case where the analyzer engine returns a symbolic value, the checker will be adding the constraint that the symbolic value is equal to the number 42.  In the more precise case where the analyzer engine inlines the function call, then the "assumption" will be that 42 (the actual return value) is equal to 42 (the value assumed by the checker).  This is a tautology, so no constraints are added, but the logic is declarative and generic.

What happens if the checker assumes something that isn't compatible with the other constraints on the value (e.g., the value is 32 instead of 42)?  In that case, the call to "assume()" will return a null GRState*, and the state is infeasible.  That's either a bug in the checker (whose assumption was wrong) or a bug in the code, which the checker could report.  In any case, everything is logically consistent and composes naturally.  The checker doesn't need to specially handle SymbolicRegions, since everything is declarative.

Now let's talk about specifics for your checker.

> Conceptually, the MemberCall for begin() returns a symbol region which I mark as BeginValid.

Instead of reasoning about a symbolic region, the checker only cares that the call to begin() just returns a region.  Instead of reasoning about symbols, that region can be marked as "BeginValid."

> It is also linked with the MemRegion associated with the instance or really any 'TypedRegion' associated with the Base of the MemberCall which is the instance.

This is another example of a relationship that the checker reasons about, and not the analyzer engine.  It makes sense to track this relationship using an ImmutableMap in the GDM, as it represents an "invariant" or "relationship" between those two regions that goes beyond reasoning about raw expression semantics.

> When the symbol region is assigned to the iterator, 'it', the BeginValid state gets propogated to the MemRegion associated with the iterator.

This is the tricky part.  The low-level semantics of C++ actually involve either an assignment or a copy constructor.  Technically a new object is getting created/initialized, which means we need to think about yet another region.  The analyzer engine should handle that part.  It should reason about the copy constructor effects (if any), just like any other function call.

What the analyzer engine *does not* handle is that the relationships you were tracking for the iterator region get propagated to the other iterator.  That needs to be handled by the checker.  This could be done either as a "postcondition", e.g., in checkPostStmt(), or we can consider adding additional API hooks to checker to have the analyzer help checkers propagate such relationships.

Fundamentally, the propagation of such relationships is the checker's responsibility because the checker is "declaring" the postconditions of the assignment.  The postcondition is that the assigned iterator has the same iterator semantics as the original.  What that mean's is up to the checker, but it's the checker's responsibility to declare and enforce those constraints.  We can add additional APIs to Checker that the analyzer engine can call to help make this more automated, but fundamentally that's the design we're going for.

> 
> Second, do I still need to handle the assignment in OperatorCall since it is essentially a structure/class assignment or will that be handled by the Engine logic?

I think I just answered this question.  Fundamentally, the analyzer engine should handle the function call and the structure/class assignment, but the Checker will need to interplay with this logic by causing the iterator "invariants" and "constraints" to propagate forward after the copy.

Does this clarify things a bit?  I know this is a bit high-level, and there are certainly pieces of the analyzer engine that are currently missing that don't implement all the pieces I described, but I think this is the right design we should go for.  It clearly separates responsibilities of who handles which semantics, and tries to get checkers out of the business of evaluating the raw semantics of expressions.

Cheers,
Ted