[cfe-dev] Several questions about UncheckedReturn statistic Checker

Mon Mar 14 15:42:10 PDT 2011

On Mar 9, 2011, at 10:28 PM, 章磊 wrote:

> Hi clang,
> 
> Sorry for this late reply.
> 
> I reimplement UncheckedReturnChecker, here are some key points and some problems confusing me.

Hi Lei,

Thanks for continuing to push forward on this.  Comments inline.

> Instead of the original syntactic approach, i track the symbolic values in BinaryOperator & UnaryOperator. It means every BinaryOperator & UnaryOperator will have a symbol to record the callexprs in subExpr. (So i create many Conjured Symbol to do this, is this all alright?)
I don't think this is quite the right approach, although your patch raises some interesting questions.  In principal, you shouldn't need to create any new conjured symbols at all.  Conjured symbols represent values to be tracked by the analyzer.  Conceptually, your checker only tracks meta-data associated with symbols.  However, what you are trying to do is a form of "taint" analysis, which may require introducing more symbols.

A few specific comments:

+  // Once we encounter a call expr, set the initial state to Unchecked.
+  if (SymbolRef Sym = V.getAsSymbol()) {
+    llvm::DenseSet<StateRef> *CRStates = new llvm::DenseSet<StateRef>();
+    CheckedReturnSet CRSet = CheckedReturnSet(CE, *CRStates);
+    CRSet.addState(CheckedReturnState::getUnchecked(CE));
+    state = state->set<CheckedReturnSetState>(Sym, CRSet);
+    C.addTransition(state);
+  }

DenseMaps should never be used as data stored by a GRState.  DenseMaps are not immutable, and your checker will now allocate a ton of memory and leak like crazy.  For data to be tracked for a given GRState, you must use immutable data structures such as ImmutableMap.  The reason is two fold:

1) ImmutableMaps can be reused in multiple instances of GRState, and we can compare (using pointer equality) if two ImmutableMaps are equivalent.  This allows us to do state caching, and also merge paths when two ExplodedNodes have the same ProgramPoint and GRState.

2) All memory associated with the ImmutableMaps (particularly those managed by the GDM) are region allocated and destroyed.  Such memory is allocated efficiently.  Using DenseMaps for per-GRState involves malloc()'ing a bunch of memory that just gets leaked.  That's really inefficient, and algorithmically not correct.

Instead, I'd expect to see something like:

if (SymbolRef sym = V.getAsSymbol()) {
  state = state->add<CheckedReturnState>(sym, CheckedReturnState::getUnchecked(CE))
  C.addTransition(state);
}

That's a lot less code and doesn't allocate a ton of (mutable) memory.  Id then replace CheckedReturnState with some like the following:

enum CheckReturnStateKind { Unchecked, Checked };
typedef llvm::ImmutableMap<SymbolRef, CheckReturnStateKind > CheckedReturnState;
namespace clang { namespace ento {
  template<>
  struct GRStateTrait<CheckReturnState > : public GRStatePartialTrait< CheckedReturnState > {
    static void* GDMIndex() { static int index = 0; return &index; }
  };
}}

That reduces about 60 lines of code to 8, and then allows you to use CheckedReturnState as I illustrated above.  After reading through more of your code, I realize that this doesn't quite match up with your algorithm, so let's continue to discuss the algorithmic design and see how it matches up with the static analyzer APIs we have.

Now let's look at checkPostStmt(BinaryOperator*):

+void UncheckedReturnChecker::checkPostStmt(const BinaryOperator *B, 
+                                           CheckerContext &C) const {
+  const GRState *state = C.getState();
+  SVal V = state->getSVal(B);
+
+  if (V.isUnknown())
+    return;
+
+  SymbolRef LHSSym = state->getSVal(B->getLHS()).getAsSymbol();
+  if (!LHSSym) {
+    // IF LHSSym is null, find conjured symbol in SymbolManager.
+    unsigned Count = C.getNodeBuilder().getCurrentBlockCount();
+    SValBuilder &svalBuilder = C.getSValBuilder();
+    LHSSym = svalBuilder.getConjuredSymbol(B->getLHS(), Count, NULL);
+  }
+  const CheckedReturnSet *LHCRSet = state->get<CheckedReturnSetState>(LHSSym);
+  
+  SymbolRef RHSSym = state->getSVal(B->getRHS()).getAsSymbol();
+  if (!RHSSym) {
+    // IF RHSSym is null, find conjured symbol in SymbolManager.
+    unsigned Count = C.getNodeBuilder().getCurrentBlockCount();
+    SValBuilder &svalBuilder = C.getSValBuilder();
+    RHSSym = svalBuilder.getConjuredSymbol(B->getRHS(), Count, NULL);
+  }
+  const CheckedReturnSet *RHCRSet = state->get<CheckedReturnSetState>(RHSSym);
+
+  BinaryOperator::Opcode Op = B->getOpcode();
+  
+  switch(Op) {
+  default:
+    // For the rest operators, do nothing.
+    break;
+
+    // An assignment expression has the value of LHS after the assignment.
+  case BO_Assign:
+  case BO_MulAssign:
+  case BO_DivAssign:
+  case BO_RemAssign:
+  case BO_AddAssign:
+  case BO_SubAssign:
+  case BO_ShlAssign:
+  case BO_ShrAssign:
+  case BO_AndAssign:
+  case BO_XorAssign:
+  case BO_OrAssign:
+    RHCRSet = NULL;
+    break;
+
+    // For Pointer-to-member operators, the result is an object or a function 
+    // of the type specified by the RHS.
+  case BO_PtrMemD:
+  case BO_PtrMemI:
+    // For comma operator, LHS is evaluated as a void expression.
+  case BO_Comma:
+    LHCRSet = NULL;
+    break;
+  }
+
+  // Combine the CheckedReturnSet of LHS and RHS.
+  if (LHCRSet || RHCRSet) {
+    SymbolRef Sym = V.getAsSymbol();
+    if (!Sym) {
+      // If Sym is null, create a new Conjured symbol for this BinaryOperator.
+      unsigned Count = C.getNodeBuilder().getCurrentBlockCount();
+      SValBuilder &svalBuilder = C.getSValBuilder();
+      Sym = svalBuilder.getConjuredSymbol(B, Count, NULL);
+    }
+    
+    llvm::DenseSet<StateRef> *CRStates = new llvm::DenseSet<StateRef>();

This is really inefficient (per comments above).

+    const CheckedReturnSet *CRSet = new CheckedReturnSet(B, *CRStates);

This is really inefficient (per comments above), and not algorithmically correct because now we won't have any state caching.  The analysis will now be inherently exponential, since we will get no path merging.  That will make the analysis a lot slower.

+    CRSet->Combine(LHCRSet, RHCRSet);
+    state = state->set<CheckedReturnSetState>(Sym, *CRSet);
+    C.addTransition(state);
+  }
+  
+  return;
+}

I think I see what you are trying to do.  Essentially:

a) the checker tracks symbols returned from function calls

b) binary/unary operations can create new values that are *derived* from symbols we are tracking.  When those new values are checked, we want to treat the original symbols as tracked, so we need a way to track that a value is derived from another.

c) What are the "checked" propagation rules?  For example:

  x = foo(); // new symbol
  y = bar(); // new symbol
  z = x + y; // new symbol, based on the symbolic values in 'x' and 'y'
  if (z) ...

Algorithmically, at 'if (z)', should the values in 'x' and 'y' be considered checked?  I'm not certain.  I think it really depends on your analysis and the "taint" propagation rules.

My gut feeling is that (b) should be managed by the analyzer engine, not a specific checker.  The reason is that many checkers may wish to do taint analysis (a general term for these kinds of problems), and track how a symbolic value "morphs" into other values.  (c) seems to be the purview of a specific checker.

So here is how I'd expect the checker to be written:

I) The core analysis engine provides a few key APIs for doing taint analysis that can be used by specific checkers.  Specifically:

a) CheckerContext provides an API for a checker to indicate that it wants to start tracking the "taint" state of a given function return value.  Operationally, this would do the following:

- If the return value is a symbol, the analysis engine adds that symbol to a list of symbols to be tracked for taint analysis.
- If the return value is not a symbol, the analysis engine conjures a new symbol, constraints that symbol to have the value returned from the function call, and then adds that symbol to a list of symbols to be tracked for taint analysis

b) When the analysis engine evaluates unary or binary operations (that can create new values derived from symbols), it checks if any of the operands are tainted symbols.  If so, it does one of the following:

- If the new value for that expression is a symbol, the analysis engine records that that new symbol was "derived" via taint analysis from the tracked tainted symbol.
- If the new value for that expression is not a symbol, the analysis engine creates a new symbol, constrains that symbol to be equal to the value computed for that expression, and then uses that new symbol as the value of the expression.  It then records that the new symbol was "derived" via taint analysis from the tracked tainted symbol."

We can have some variations here, but with this first cut essentially the checker would need no additional code to do the taint propagation.  It would be done entirely by the analysis engine.  This causes your checkPostStmt(BinaryOperator*) and checkPostStmt(UnaryOperator*) to completely vanish.

c) The analysis engine provides APIs to query if a symbol is tainted, and walk the tree of symbols from which it inherits its "tainted" classification.

II) Using the APIs above, the checker would consist of two parts:

- Use API (a) in checkPostStmt(CallExpr).  The checker would get the symbol returned by the API, and set it to the "Unchecked" state (per my example code above).
- Use API (c) in checkBranchCondition().  If the condition evaluates to a tainted symbol, we can explore the tree of symbols from which it derives it taint status and mark all symbols in that tree as being "checked".

The nice thing about this approach is that your checker is just in the business of tracking basic typestate, and (ideally) does minimal work to propagate the taint propagation (which can really only be done well in the analysis engine itself).
> I create a global DenseMap<const CallExpr*, CheckedReturnState*> to store whether a callexpr is checked or not.
This probably isn't the right approach, because we really want to shy away from global state.  It also makes a bunch of assumptions that might not be true.  For example, it assumes that functions are analyzed independently and that we do no inter-procedural analysis.  While that's what we mostly do right now, that isn't an invariant.

For example, consider the case where we do basic inter-procedural analysis where we "inline" function calls.  That is, when we see a function call, we continuing creating the ExplodedGraph by tracing into the called function, and then when we return from that function we continue analyzing the original function.  We then have one analysis pass, and the "global" state is associated with analyzing all of those functions.  What happens if the called function is the same function we are already analyzing?  E.g., we are analyzing foo(), and foo() calls itself.  This will cause the counts to be all messed up, as they will conflate two calls.  Is this what we want?  Maybe, but it should be a deliberate decision, not an accident.  Further, Checker objects now exist for the entirety of analyzing ALL functions in a translation unit.  This means that they are reused by multiple GRExprEngine instances.  This means that such global state really is a problem, as it is shared between multiple analysis runs.

Instead, I suggest we decompose things as follows:

- Provide a way to associate checker-specific state with a LocationContext.  A LocationContext can encapsulate the current "stack frame", and thus provides the "local" but "global" analysis state you are looking for.
- Using the "local" state I suggested, Incrementally build up the statistical counting as needed.  We only need to update the counts in two places:

a) in checkBranchCondition(), when we transition a tracked symbol to the "checked" state, we can increment the checked count for that symbol.
b) when we handle RemoveDeadSymbols and a symbol is in the "unchecked" state, we increment the unchecked count for that symbol.

We keep statistical counts for each LocationContext (or rather, StackFrameContext), and then do the post-analysis based on that.
> This patch didn't consider the deal with the escaped symbol.

We can handle this in a variety of ways.  We can add a new checker callback when symbols escape, and then have the Checker respond accordingly.

> The bugreport is very simple now...
That's fine.  That's polish.

> This checker seems much slower than other...
There are a couple reasons for this.  First, you are malloc'ing a ton of objects.  malloc() is really slow, and your checker now leaks like crazy.  All checker state should be allocated using the BumpPtrAllocator associated with GRStateManager.  Second, a semantic approach is inherently slower than a syntactic one, but far more general and thorough.  Third, because of the way you are tracking checker state, you have forced the static analyzer to be worst-case exponential.  There will be no state caching as there is no way to merge isomorphic states because of the freshly created DenseMaps.

> After all, there are many work should be done. I write this mail to ask whether this direction is right or not.

After writing my comments above, I realized this really is a big project, not just a new checker.  To do it semantically, we need a bunch of new infrastructure that currently doesn't exist.  I see two options:

a) We forge ahead on such infrastructure (for taint propagation in the analysis engine).  I can work with you to gradually make this happen, but this will take time.  Once that's in place, we use that new infrastructure to implement your checker using that infrastructure (or we do this in tandem).
b) We go (for now) with your original "syntactic" version of the checker, and fix the issues that would be common to both the semantic and syntactic versions.

I think we should do both.  The good thing about (b) is that it establishes a baseline.  That baseline allows us to write tests, etc.  It also allows you to not commit staying involved beyond where your interest or time peaks out.  With (a), we can keep (b) around as an entirely separate implementation, and gradually make progress on (a) until we feel it is good enough to replace (b).

I know this is a lot of data, and I feel I probably explained a few things very poorly.  At a high level, I think to do your checker "right" we need to add a bunch of infrastructure to the core analysis engine that doesn't exist.  Once that infrastructure exists, however, I think the checker itself should be fairly simple.

Thoughts?  Comments?  How would you like to proceed?

Cheers,
Ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110314/72fec4c7/attachment.html>