[cfe-dev] [analyzer] Summary IPA thoughts

Wed Oct 21 04:50:20 PDT 2015

 Dear All,

We down there (mostly me and Alexey Sidorin, whom you might have also
seen here in the lists), after some digging into the analyzer code,
have coded a working proof-of-concept draft of summary-based
inter-procedural analysis implementation for the static analyzer. The
code is still too rough and dirty to publish as a review, however, as
tl;dr of the current status, we reached the point where:

  (good.1) The implementation doesn't crash, on quite large test
codebases, eg. Android;
  (good.2) It also doesn't seem to cause any noticable increase in
false positive rate;
  (good.3) Path-sensitive bug reports are constructed more or less correctly.

Additionally, the implementation is not really extremely complicated
or hard-wired together, so many things can be re-discussed and
re-thought.

The bad parts, still many, but hopefully some of them would get sorted
out in time:

  (bad.1) Doesn't yet offer as much performance gain, in
bugs-per-second, as we initially expected;
  (bad.2) Requires significant support on checker side, unlike
inlining, and most official checkers were not yet updated, instead we
were tweaking our own checkers, yet the method seems universal enough;
  (bad.3) Could use some refactoring and also needs rebase on the
latest clang, cause we diverged quite a bit, before we publish;
  (bad.4) Certain features are not yet supported:
    - Recursion is not yet modeled correctly;
    - Taint analysis is not yet supported;
    - Only default implementations of Store Manager and Constraint
Manager are supported;
    - Objective-C is not yet supported (unfortunately, none of us
understands it well enough).

Now, the design we implemented as a quick proof-of-concept is roughly
as follows:
  (a) Analyze the function without any context, as top-level;
  (b) Save its ExplodedGraph, which *is* *the* summary of the function;
  (c) When modeling the call, take the final states in leaf nodes of
its graph, and
    (c.i) Rename the symbolic values inside them according to the context;
    (c.ii) Merge the renamed range constraints into the current state;
(c.iii) Merge the GDM into the current state, with inevitable help
from checkers;
    (c.iv) Throw bug reports if a bug was revealed while applying the summary;
    (c.v) Merge the renamed store bindings into the current state,
including invalidation markers.

Step (a): We conduct a normal symbolic execution, as usual. When we
encounter a call, we pause current analysis until we collect the
summary of the callee function. If we encounter recursion, we go for
conservative evaluation. If normal analysis exceeds node limit, we
discard summary and evaluate the function conservatively.
Additionally, we add a few program state traits: most importantly,
when SymbolReaper destroys a symbol, we need to keep its range in a
separate trait, because it's a valuable piece of info on function
branch reachibility conditions, which we want to have reflected in the
final node.

Step (b): We simply don't delete the ExprEngine object, so that all
entities, such as States and SVals, had no stale pointers to various
managers and contexts inside them. Storing the whole graph, rather
than only leaf states, has a single purpose of reconstructing the
path-sensitive bug report; it's not useful for anything else. Of
course, it still offers a lot of room for memory optimizations (clean
unnecesary nodes, unnecessary program state traits, and that sort of
stuff, or even delete the graph and re-construct it only if we need to
actually throw the report).

Step (c): In ExplodedGraph of the caller, we create a node with a
special program point called CallSummaryPreApply. Out of this node, we
create new nodes for each reachable branch of the summary, that is,
for every leaf node of the summary graph which is reachable in current
context. Reachability is determined on step (c.ii). After a few
technical intermediate transitions, we reach another node, program
point of which is CallSummaryPostApply.

Step (c.i): So far, we did well without complete alpha-renaming
support. Instead, all we needed was to recursively scan through the
structure of the SVal and replace context-dependent stuff stored
deeply inside them with actual stuff, eg. SymbolRegionValue of a
formal parameter with actual argument value. Hence, to avoid confusion
with full-featured alpha-renaming, we called this "actualization";
it'd take extra work to see if we can actually generalize it to create
a good full-featured alpha-renaming mechanism (replace arbitrary
values with arbitrary values inside the whole states). This procedure
is one of the most complicated parts of our work, as it needs to
support the whole SVal hierarchy, and its implementation is also
already quite isolated. For SymbolMetadata, we ask checkers to perform
actualization through a callback, which seems reasonable. For
SymbolConjured, we fake the block counts in order to avoid accidental
collisions with existing symbols. Also, actualization isn't really a
separate step in the process; it's performed regularly for all
symbolic values on all steps while applying a summary.

Step (c.ii): Actualize constraint keys, intersect "new" and "old"
ranges, throw away the branch if intersection is empty, otherwise
constrain the actual symbol with the intersected range. As we
mentioned on step (a), this requires a separate pass through reaped
symbols which are stored separately.

Step (c.iii): Call the checker callback, "eval::SummaryApply", in
which checkers receive the final state in callee and the current state
in caller, and figure out how to move significant stuff from the
former to the latter. Checkers may add transitions or split states on
this phase. Most of the time, they do three things: - Carry
information that changed in the caller to the callee; - Check stuff in
which they were uncertain during callee analysis, maybe they'll be
sure now that they have a context (delayed checks); - Carry delayed
checks through to the callee if they could not be resolved anyway.
Note that delayed checks would need to be manually stored in the state
so that checkers could note them during apply. So overally, if the
checker has ~5 or so GDM traits, implementing a summary-apply
procedure may be pretty complicated.

Step (c.iv): Path-sensitive bug reports are constructed by
frankensteining the report path from different pieces of different
exploded graphs, be it a report that originates from the callee (and
hence thrown on step (c.iii) from eval::SummaryApply), or an
intermediate call that exited before the bug (which may be interesting
for the bug reporter visitors). The only trouble we currently have is
the "Assuming..." pieces, which may be redundant, as no assumption was
made, but rather the context determined the branch.

Step (c.v): Actualize the binding keys and values, overwrite existing
values. Invalidation, as in writing some SymbolConjured to some memory
space, is handled similarly. before all bindings. We do this last,
because it would affect actualization (eg. SymbolRegionValue should
actualize itself to the value of the region before the call, not after
the call; quite a sensitive point here, but seems to work this way).

So our current plan, in case nobody minds, would be:
  (1) Implement an SValVisitor, which may be useful for other
applications as well, as a separate patch;
  (2) Try to implement a full-featured alpha-renaming support, which
we almost have, on top of SValVisitor, probably add some useful
features to the constraint manager, as a separate patch;
  (3) Try to port the rest of the summary code, with heavy discussions
on fixing existing problems, some of which were listed above;
  (4) Dunno, maybe something else shows up, probably more ways to
split the patch into smaller pieces.

We'd be extremely greatful for [and promise to listen carefully to]
any feedback!

Thanks in advance, Artem [and Alexey].