[cfe-dev] [analyzer] Summary IPA thoughts

Wed Oct 21 15:33:00 PDT 2015

> On Oct 21, 2015, at 4:50 AM, hao/NoQ via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> Dear All,
> 
> We down there (mostly me and Alexey Sidorin, whom you might have also
> seen here in the lists), after some digging into the analyzer code,
> have coded a working proof-of-concept draft of summary-based
> inter-procedural analysis implementation for the static analyzer.

This sounds interesting! If I understand correctly, your goal was to improve analyzer performance by avoiding inlining? Could we also use this high-level approach to enable cross-translation unit analysis?

> The
> code is still too rough and dirty to publish as a review, however, as
> tl;dr of the current status, we reached the point where:
> 
>  (good.1) The implementation doesn't crash, on quite large test
> codebases, eg. Android;
>  (good.2) It also doesn't seem to cause any noticable increase in
> false positive rate;

This is interesting and suggests that levels of context sensitivity higher than greater than 1 (as afforded by inlining) might not really add much in terms of reducing false positives.

>  (good.3) Path-sensitive bug reports are constructed more or less correctly.
> 
> Additionally, the implementation is not really extremely complicated
> or hard-wired together, so many things can be re-discussed and
> re-thought.
> 
> The bad parts, still many, but hopefully some of them would get sorted
> out in time:
> 
>  (bad.1) Doesn't yet offer as much performance gain, in
> bugs-per-second, as we initially expected;
>  (bad.2) Requires significant support on checker side, unlike
> inlining, and most official checkers were not yet updated, instead we
> were tweaking our own checkers, yet the method seems universal enough;
>  (bad.3) Could use some refactoring and also needs rebase on the
> latest clang, cause we diverged quite a bit, before we publish;
>  (bad.4) Certain features are not yet supported:
>    - Recursion is not yet modeled correctly;
>    - Taint analysis is not yet supported;
>    - Only default implementations of Store Manager and Constraint
> Manager are supported;

By this you mean the RegionStoreManager and RangeConstraintManager, right?

>    - Objective-C is not yet supported (unfortunately, none of us
> understands it well enough).
> 
> Now, the design we implemented as a quick proof-of-concept is roughly
> as follows:
>  (a) Analyze the function without any context, as top-level;
>  (b) Save its ExplodedGraph, which *is* *the* summary of the function;
>  (c) When modeling the call, take the final states in leaf nodes of
> its graph, and
>    (c.i) Rename the symbolic values inside them according to the context;
>    (c.ii) Merge the renamed range constraints into the current state;
> (c.iii) Merge the GDM into the current state, with inevitable help
> from checkers;
>    (c.iv) Throw bug reports if a bug was revealed while applying the summary;
>    (c.v) Merge the renamed store bindings into the current state,
> including invalidation markers.
> 
> Step (a): We conduct a normal symbolic execution, as usual. When we
> encounter a call, we pause current analysis until we collect the
> summary of the callee function. If we encounter recursion, we go for
> conservative evaluation. If normal analysis exceeds node limit, we
> discard summary and evaluate the function conservatively.
> Additionally, we add a few program state traits: most importantly,
> when SymbolReaper destroys a symbol, we need to keep its range in a
> separate trait, because it's a valuable piece of info on function
> branch reachibility conditions, which we want to have reflected in the
> final node.
> 
> Step (b): We simply don't delete the ExprEngine object, so that all
> entities, such as States and SVals, had no stale pointers to various
> managers and contexts inside them. Storing the whole graph, rather
> than only leaf states, has a single purpose of reconstructing the
> path-sensitive bug report; it's not useful for anything else. Of
> course, it still offers a lot of room for memory optimizations (clean
> unnecesary nodes, unnecessary program state traits, and that sort of
> stuff, or even delete the graph and re-construct it only if we need to
> actually throw the report).

You mentioned in a follow-up email that you see a ~5x increase in memory usage with your prototype. Do have a sense of where this cost comes from? Is it keeping the summaries themselves?

> Step (c): In ExplodedGraph of the caller, we create a node with a
> special program point called CallSummaryPreApply. Out of this node, we
> create new nodes for each reachable branch of the summary, that is,
> for every leaf node of the summary graph which is reachable in current
> context. Reachability is determined on step (c.ii). After a few
> technical intermediate transitions, we reach another node, program
> point of which is CallSummaryPostApply.
> 
> Step (c.i): So far, we did well without complete alpha-renaming
> support. Instead, all we needed was to recursively scan through the
> structure of the SVal and replace context-dependent stuff stored
> deeply inside them with actual stuff, eg. SymbolRegionValue of a
> formal parameter with actual argument value. Hence, to avoid confusion
> with full-featured alpha-renaming, we called this "actualization";
> it'd take extra work to see if we can actually generalize it to create
> a good full-featured alpha-renaming mechanism (replace arbitrary
> values with arbitrary values inside the whole states). This procedure
> is one of the most complicated parts of our work, as it needs to
> support the whole SVal hierarchy, and its implementation is also
> already quite isolated. For SymbolMetadata, we ask checkers to perform
> actualization through a callback, which seems reasonable. For
> SymbolConjured, we fake the block counts in order to avoid accidental
> collisions with existing symbols. Also, actualization isn't really a
> separate step in the process; it's performed regularly for all
> symbolic values on all steps while applying a summary.
> 
> Step (c.ii): Actualize constraint keys, intersect "new" and "old"
> ranges, throw away the branch if intersection is empty, otherwise
> constrain the actual symbol with the intersected range. As we
> mentioned on step (a), this requires a separate pass through reaped
> symbols which are stored separately.
> 
> Step (c.iii): Call the checker callback, "eval::SummaryApply", in
> which checkers receive the final state in callee and the current state
> in caller, and figure out how to move significant stuff from the
> former to the latter. Checkers may add transitions or split states on
> this phase. Most of the time, they do three things: - Carry
> information that changed in the caller to the callee; - Check stuff in
> which they were uncertain during callee analysis, maybe they'll be
> sure now that they have a context (delayed checks); - Carry delayed
> checks through to the callee if they could not be resolved anyway.
> Note that delayed checks would need to be manually stored in the state
> so that checkers could note them during apply. So overally, if the
> checker has ~5 or so GDM traits, implementing a summary-apply
> procedure may be pretty complicated.

Can you tell us more about these delayed checks? I’m particularly wondering about type-state-like checks that don’t issue warnings on unknown values.

For example, suppose you had a checker like SimpleStreamChecker that issues a warning on a double close. If a symbol ref for a file descriptor is marked as closed in its state then the checker issues a warning, but if the symbol ref is unknown, it just transitions it to the closed type state.

So it you had:

  FILE *f = ...
  fclose(f)
  myFunction(f);

where

  void myFunction(FILE *p) {
    fclose(p)
  }

Then with inlining you would get a diagnostic about a double close at fclose(p). But with a summary-based approach like you suggest the unmodified SimpleStreamChecker wouldn’t issue a warning. How would you recommend changing such a checker to handle summaries? Are there common patterns here that we could expose to checker writers?

> Step (c.iv): Path-sensitive bug reports are constructed by
> frankensteining the report path from different pieces of different
> exploded graphs, be it a report that originates from the callee (and
> hence thrown on step (c.iii) from eval::SummaryApply), or an
> intermediate call that exited before the bug (which may be interesting
> for the bug reporter visitors). The only trouble we currently have is
> the "Assuming..." pieces, which may be redundant, as no assumption was
> made, but rather the context determined the branch.

In some cases it seems like it ought to be possible to query the state at “assuming…” to determine whether the context always implies that the assumption holds and elide the piece if not needed. Does the context always determine that the branch? Similarly, can there be cases where a path is dead (the branch could not be taken based on the actual) but the actualized post-state does not have enough information about to yield a contradiction? If I understand correctly, this is why you keep the ranges of dead symbols around? Would you also need need to keep checker state for these symbols around so they can detect contradictions when applying the summary?

> Step (c.v): Actualize the binding keys and values, overwrite existing
> values. Invalidation, as in writing some SymbolConjured to some memory
> space, is handled similarly. before all bindings. We do this last,
> because it would affect actualization (eg. SymbolRegionValue should
> actualize itself to the value of the region before the call, not after
> the call; quite a sensitive point here, but seems to work this way).

Given the region store’s differing representations of concrete and symbolic offsets, this sounds like it could be quite tricky. For example, does this work for parameter-dependent offsets in the summary post-state where in the summary the parameter is symbolic but at the call site the parameter is concrete? More generally, I’d love to hear more about how you apply summaries to the region store, how you identify the frame and the footprint, how you unify symbolic variables, etc. 

Devin