[cfe-dev] Several questions about UncheckedReturn statistic Checker

Sat Mar 19 07:36:08 PDT 2011

Hi Ted,

Thank you very much for your reply. It solved many problems that confusing
me.

Sorry for the misuse of the APIs and data structures. My intention of last
mail is to show what i am trying to do by code without changing the Engine
and the checker visit mechanism.

If you don't mention "taint analysis" in your email, i didn't realize that
yet. Yes, i think it's some sort of taint analysis.
IMO, some key points in this 'taint' analysis are:

source: return values from all the CallExpr.
sanitizer: the condition in a BranchCondition.
sink: when we handle RemoveDeadSymbols?(according to your mail)

And as you point, the 'taint' propagation ruls is the most important thing.
So i think this analysis is more or less different from the traditional
taint analysis.

More comments inline.

2011/3/15 Ted Kremenek <kremenek at apple.com>

>   On Mar 9, 2011, at 10:28 PM, 章磊 wrote:
>
> Hi clang,
>
> Sorry for this late reply.
>
> I reimplement UncheckedReturnChecker, here are some key points and some
> problems confusing me.
>
>
> Hi Lei,
>
> Thanks for continuing to push forward on this.  Comments inline.
>
>
>    1. Instead of the original syntactic approach, i track the symbolic
>    values in BinaryOperator & UnaryOperator. It means every BinaryOperator &
>    UnaryOperator will have a symbol to record the callexprs in subExpr. (So i
>    create many Conjured Symbol to do this, is this all alright?)
>
> I don't think this is quite the right approach, although your patch raises
> some interesting questions.  In principal, you shouldn't need to create any
> new conjured symbols at all.  Conjured symbols represent values to be
> tracked by the analyzer.  Conceptually, your checker only tracks meta-data
> associated with symbols.  However, what you are trying to do is a form of
> "taint" analysis, which may require introducing more symbols.
>

If i should not create conjured symbols, which symbol should i use.

>   A few specific comments:
>
>  +  // Once we encounter a call expr, set the initial state to Unchecked.
> +  if (SymbolRef Sym = V.getAsSymbol()) {
> +    llvm::DenseSet<StateRef> *CRStates = new llvm::DenseSet<StateRef>();
> +    CheckedReturnSet CRSet = CheckedReturnSet(CE, *CRStates);
> +    CRSet.addState(CheckedReturnState::getUnchecked(CE));
> +    state = state->set<CheckedReturnSetState>(Sym, CRSet);
> +    C.addTransition(state);
> +  }
>
>  DenseMaps should never be used as data stored by a GRState.  DenseMaps
> are not immutable, and your checker will now allocate a ton of memory and
> leak like crazy.  For data to be tracked for a given GRState, you must use
> immutable data structures such as ImmutableMap.  The reason is two fold:
>
> 1) ImmutableMaps can be reused in multiple instances of GRState, and we can
> compare (using pointer equality) if two ImmutableMaps are equivalent.  This
> allows us to do state caching, and also merge paths when two ExplodedNodes
> have the same ProgramPoint and GRState.
>
> 2) All memory associated with the ImmutableMaps (particularly those managed
> by the GDM) are region allocated and destroyed.  Such memory is allocated
> efficiently.  Using DenseMaps for per-GRState involves malloc()'ing a bunch
> of memory that just gets leaked.  That's really inefficient, and
> algorithmically not correct.
>

Errr, sorry about that.

>
>
   Instead, I'd expect to see something like:
>
> if (SymbolRef sym = V.getAsSymbol()) {
>   state = state->add<CheckedReturnState>(sym,
> CheckedReturnState::getUnchecked(CE))
>   C.addTransition(state);
> }
>
> That's a lot less code and doesn't allocate a ton of (mutable) memory.  Id
> then replace CheckedReturnState with some like the following:
>
> enum CheckReturnStateKind { Unchecked, Checked };
>  typedef
> llvm::ImmutableMap<SymbolRef, CheckReturnStateKind > CheckedReturnState;
> namespace clang { namespace ento {
>   template<>
>   struct GRStateTrait<CheckReturnState >
> : public GRStatePartialTrait< CheckedReturnState > {
>     static void* GDMIndex() { static int index = 0; return &index; }
>   };
> }}
>
> That reduces about 60 lines of code to 8, and then allows you to use
> CheckedReturnState as I illustrated above.  After reading through more of
> your code, I realize that this doesn't quite match up with your algorithm,
> so let's continue to discuss the algorithmic design and see how it matches
> up with the static analyzer APIs we have.
>
> Now let's look at checkPostStmt(BinaryOperator*):
>
> I think I see what you are trying to do.  Essentially:
>
> a) the checker tracks symbols returned from function calls
>
> b) binary/unary operations can create new values that are *derived* from
> symbols we are tracking.  When those new values are checked, we want to
> treat the original symbols as tracked, so we need a way to track that a
> value is derived from another.
>
> c) What are the "checked" propagation rules?  For example:
>
>   x = foo(); // new symbol
>   y = bar(); // new symbol
>   z = x + y; // new symbol, based on the symbolic values in 'x' and 'y'
>   if (z) ...
>
> Algorithmically, at 'if (z)', should the values in 'x' and 'y' be
> considered checked?  I'm not certain.  I think it really depends on your
> analysis and the "taint" propagation rules.
>

IMO, 'x' and 'y' should be considered checked. But there is no evidence, so
i think the 'taint' propagation rules needs to much more study and test.

>
>
    My gut feeling is that (b) should be managed by the analyzer engine, not
> a specific checker.  The reason is that many checkers may wish to do taint
> analysis (a general term for these kinds of problems), and track how a
> symbolic value "morphs" into other values.  (c) seems to be the purview of a
> specific checker.
>
> So here is how I'd expect the checker to be written:
>
> I) The core analysis engine provides a few key APIs for doing taint
> analysis that can be used by specific checkers.  Specifically:
>
> a) CheckerContext provides an API for a checker to indicate that it wants
> to start tracking the "taint" state of a given function return value.
>  Operationally, this would do the following:
>
> - If the return value is a symbol, the analysis engine adds that symbol to
> a list of symbols to be tracked for taint analysis.
> - If the return value is not a symbol, the analysis engine conjures a new
> symbol, constraints that symbol to have the value returned from the function
> call, and then adds that symbol to a list of symbols to be tracked for taint
> analysis
>
> b) When the analysis engine evaluates unary or binary operations (that can
> create new values derived from symbols), it checks if any of the operands
> are tainted symbols.  If so, it does one of the following:
>
> - If the new value for that expression is a symbol, the analysis engine
> records that that new symbol was "derived" via taint analysis from the
> tracked tainted symbol.
> - If the new value for that expression is not a symbol, the analysis engine
> creates a new symbol, constrains that symbol to be equal to the value
> computed for that expression, and then uses that new symbol as the value of
> the expression.  It then records that the new symbol was "derived" via taint
> analysis from the tracked tainted symbol."
>
> We can have some variations here, but with this first cut essentially the
> checker would need no additional code to do the taint propagation.  It would
> be done entirely by the analysis engine.  This causes your
> checkPostStmt(BinaryOperator*) and checkPostStmt(UnaryOperator*) to
> completely vanish.
>
> c) The analysis engine provides APIs to query if a symbol is tainted, and
> walk the tree of symbols from which it inherits its "tainted"
> classification.
>

How the Engine works depends on how we define the propagation rules. If we
wanna make checkPostStmt(BinaryOperator *) and checkPostStmt(UnaryOperator
*) to completely vanish, we must implement those rules in the engine. But we
may have different propagation rules, like traditional one and what we have
in UncheckedReturnChecker. Should all this rules work in the engine? Or
maybe we need have some callback for propagation?

>
>
    II) Using the APIs above, the checker would consist of two parts:
>
> - Use API (a) in checkPostStmt(CallExpr).  The checker would get the symbol
> returned by the API, and set it to the "Unchecked" state (per my example
> code above).
> - Use API (c) in checkBranchCondition().  If the condition evaluates to a
> tainted symbol, we can explore the tree of symbols from which it derives it
> taint status and mark all symbols in that tree as being "checked".
>
> The nice thing about this approach is that your checker is just in the
> business of tracking basic typestate, and (ideally) does minimal work to
> propagate the taint propagation (which can really only be done well in the
> analysis engine itself).
>

Yes, i think this is the right apprach.
Thank you very much for pointing this. I knew something wrong with my patch,
but i couldn't tell it. So i didn't go any further, and wrote last mail for
help.

>
>    1. I create a global DenseMap<const CallExpr*, CheckedReturnState*> to
>    store whether a callexpr is checked or not.
>
> This probably isn't the right approach, because we really want to shy away
> from global state.  It also makes a bunch of assumptions that might not be
> true.  For example, it assumes that functions are analyzed independently and
> that we do no inter-procedural analysis.  While that's what we mostly do
> right now, that isn't an invariant.
>
> For example, consider the case where we do basic inter-procedural analysis
> where we "inline" function calls.  That is, when we see a function call, we
> continuing creating the ExplodedGraph by tracing into the called function,
> and then when we return from that function we continue analyzing the
> original function.  We then have one analysis pass, and the "global" state
> is associated with analyzing all of those functions.  What happens if the
> called function is the same function we are already analyzing?  E.g., we are
> analyzing foo(), and foo() calls itself.  This will cause the counts to be
> all messed up, as they will conflate two calls.  Is this what we want?
>  Maybe, but it should be a deliberate decision, not an accident.  Further,
> Checker objects now exist for the entirety of
>

Yes, you are right. I forget about this.

>  analyzing ALL functions in a translation unit.  This means that they are
> reused by multiple GRExprEngine instances.  This means that such global
> state really is a problem, as it is shared between multiple analysis runs.
>

Errr, I didn't make this clear. IMO, this global DenseMap<const CallExpr*,
CheckedReturnState*> is used by the whole translation unit. All the
CheckedReturnStates in different GRExprEngine instances are record here, and
do the statistica count after all the function handled.
However, using a global state like this is indeed wierd. And i will move on
as you suggested.

>
> Instead, I suggest we decompose things as follows:
>
> - Provide a way to associate checker-specific state with a LocationContext.
>  A LocationContext can encapsulate the current "stack frame", and thus
> provides the "local" but "global" analysis state you are looking for.
> - Using the "local" state I suggested, Incrementally build up the
> statistical counting as needed.  We only need to update the counts in two
> places:
>
> a) in checkBranchCondition(), when we transition a tracked symbol to the
> "checked" state, we can increment the checked count for that symbol.
> b) when we handle RemoveDeadSymbols and a symbol is in the "unchecked"
> state, we increment the unchecked count for that symbol.
>
> We keep statistical counts for each LocationContext (or rather,
> StackFrameContext), and then do the post-analysis based on that.
>

OK, this souds much better than what i did, and i need to read more code
about LocationContext.

>
>    1. This patch didn't consider the deal with the escaped symbol.
>
>
> We can handle this in a variety of ways.  We can add a new checker callback
> when symbols escape, and then have the Checker respond accordingly.
>
>
>    1. The bugreport is very simple now...
>
> That's fine.  That's polish.
>
>
>    1. This checker seems much slower than other...
>
> There are a couple reasons for this.  First, you are malloc'ing a ton of
> objects.  malloc() is really slow, and your checker now leaks like crazy.
>  All checker state should be allocated using the BumpPtrAllocator associated
> with GRStateManager.  Second, a semantic approach is inherently slower than
> a syntactic one, but far more general and thorough.  Third, because of the
> way you are tracking checker state, you have forced the static analyzer to
> be worst-case exponential.  There will be no state caching as there is no
> way to merge isomorphic states because of the freshly created DenseMaps.
>

Thank you for pointing these out.

>
> After all, there are many work should be done. I write this mail to ask
> whether this direction is right or not.
>
>
> After writing my comments above, I realized this really is a big project,
> not just a new checker.  To do it semantically, we need a bunch of new
> infrastructure that currently doesn't exist.  I see two options:
>
> a) We forge ahead on such infrastructure (for taint propagation in the
> analysis engine).  I can work with you to gradually make this happen, but
> this will take time.  Once that's in place, we use that new infrastructure
> to implement your checker using that infrastructure (or we do this in
> tandem).
> b) We go (for now) with your original "syntactic" version of the checker,
> and fix the issues that would be common to both the semantic and syntactic
> versions.
>
> I think we should do both.  The good thing about (b) is that it establishes
> a baseline.  That baseline allows us to write tests, etc.  It also allows
> you to not commit staying involved beyond where your interest or time peaks
> out.  With (a), we can keep (b) around as an entirely separate
> implementation, and gradually make progress on (a) until we feel it is good
> enough to replace (b).
>
> I know this is a lot of data, and I feel I probably explained a few things
> very poorly.  At a high level, I think to do your checker "right" we need to
> add a bunch of infrastructure to the core analysis engine that doesn't
> exist.  Once that infrastructure exists, however, I think the checker itself
> should be fairly simple.
>
> Thoughts?  Comments?  How would you like to proceed?
>

I think option(a) is what i want. But to implement it, i need to study
the the taint analysis theory and find out the propagation rules we should
establish for this checker. And i can't guarantee the time spent on it.
So as you said, we go with (b) first, and then gradually make (a) happen.

I will go back the with the original "syntactic" version of the checker, and
see what i can do about (a) meanwhile.

What you say?

>
> Cheers,
> Ted
>

-- 
Best regards!

Lei Zhang
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110319/f1faecbf/attachment.html>