[cfe-dev] [analyzer] Summary IPA thoughts

Thu Oct 22 05:11:01 PDT 2015

Wow, that's a lot of stuff!
Thanks a lot for all your encouragement!

=== Regarding inter-unit and purposes: ===

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
 > Is this within a translation unit? Are you planning to make this
 > method usable across translation units?

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > This sounds interesting! If I understand correctly, your goal was to
 > improve analyzer performance by avoiding inlining? Could we also use
 > this high-level approach to enable cross-translation unit analysis?

Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
 > Is the participation needed for serializing out state if we want to
 > make these summaries persistent?

We do eventually aim for cross-translation-unit analysis as well, 
however it's quite a bit of a separate story. Oh, i guess Alexey has 
already covered that. Apart from this, of course, we also desire to 
improve time-performance of path-sensitive checks inside a single 
translation unit, because it seems to make some sense when the analyzer 
is integrated into build systems of large projects, and such, so we wish 
to reduce analysis overhead in such use-cases.

We did not yet try out any sort of serialization of summaries. Once we 
do, then, yeah, most likely, we'd probably need extra checker-side 
support to serialize their data.

=== Regarding the checker-side support: ===

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
 > What makes the help from the checkers inevitable here?

Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
 > What kind of participation is needed from a checker author to
 > participate in this design? Since it is based on the ExplodedGraph,
 > it would seem their participation would be automatic.

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > Can you tell us more about these delayed checks? I’m particularly
 > wondering about type-state-like checks that don’t issue warnings on
 > unknown values.
 >
 > For example, suppose you had a checker like SimpleStreamChecker that
 > issues a warning on a double close. If a symbol ref for a file
 > descriptor is marked as closed in its state then the checker issues a
 > warning, but if the symbol ref is unknown, it just transitions it to
 > the closed type state.
 >
 > So it you had:
 >
 > FILE *f = ...
 > fclose(f)
 > myFunction(f);
 >
 > where
 >
 > void myFunction(FILE *p) {
 > fclose(p)
 > }
 >
 > Then with inlining you would get a diagnostic about a double close at
 > fclose(p). But with a summary-based approach like you suggest the
 > unmodified SimpleStreamChecker wouldn’t issue a warning. How would
 > you recommend changing such a checker to handle summaries? Are there
 > common patterns here that we could expose to checker writers?

Precisely, the code sample provided by Devin Coughlin, together with the 
"opposite" sample:
   FILE *f = ...;
   myFunction(f);
   fclose(f);
, is the simplest motivational example that shows that checker-side 
support is inevitable. There's no defect in myFunction() itself, however 
it changes the GDM (which is visible to checkers only) to cause the warning.

We did plan to make some "How to add summary IPA support to your checker 
in 24 hours" how-to :)

The modification needed/proposed-by-our-approach to let 
SimpleStreamChecker conduct its double-close checks with summary is as 
follows:

   1. Store a list of all open/close operations in the program state as 
a separate list-type trait, with node pointers, pointer symbols, and 
operation types.
   2. In summary-apply callback, iterate over this list found in the 
summary leaf, actualize symbols, and re-model the operations on top of 
the caller state:
     2a. See if the operation is valid, report double-close if not; 
supply the stored node to the bug reporter so that it could glue the 
report path together in the Devin Coughlin's example and throw the 
report against a correct node; storing nodes is also useful for 
implementing bug visitors.
     2b. Mark the relevant symbols as open or closed in the caller 
context. This would let us handle the "opposite" example, which would 
throw the report in the caller context after changes in GDM caused by 
the callee.
     2c. Repeat these changes in the caller's operation list, so that 
deeper call stacks would be supported. If bug report is constructed 
through a deep call stack, the checker would need to provide BugReporter 
with a stack of nodes.

Well, roughly, these things. Most of the checkers that work with GDM in 
a similar manner would require very similar facilities, in fact such 
approach worked for ~6 pretty checkers we tried it upon, some of which 
were pretty weird (eg. statistical analysis of the whole graph) (we 
measure productivity on only two of them, because other checkers are too 
quiet).

It's not as scary as it sounds, probably ~20 lines of code per checker 
with small number of GDM entries.

On how to avoid modifying checkers: it might be possible to simplify 
this procedure significantly, but only by throwing away the idea of the 
opaque *Generic* data map in the state, replacing it with a limited set 
of structures to allow handling most of the actualization by the 
analyzer engine. This approach seems worse, as it greatly decreases the 
flexibility, limiting the number of checkers we can implement without 
adding stuff into core, so we quickly realized that it's not really 
worth trying.

So i think yeah, modifying checkers is inevitable. They need to have 
their own summary of the branch, need to know how to apply it.

Todo for myself: we didn't yet have a chance to think deeply about 
leak-related checks, eg. lifetime of symbols that were created during 
actualization, or how symbols leak through/inside summaries, or probably 
avoid actualizing symbols that would die instantly.

=== Regarding the ranges and reachibility: ===

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
 > I think reachability is a very important question. How do you
 > associate the symbols, the constraints and the branches?

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
 > In general, it is great, to have a summary based approach to
 > interprocedural analysis, but could you list the high level goals of
 > your approach and what improvements are you expecting?

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > One consequence of analyzing the function at the top level is that
 > the summary produced makes no assumptions about context. This means
 > the summary is always applicable (i.e., the precondition holds) but
 > also that applying it may yield many more paths than are actually
 > feasible.

Yeah, so what we do:: We consider each non-sink[1] leaf in the 
ExplodedGraph of the summary separately. In this leaf, we take all info 
stored by RangeConstraintManager - ranges on all symbols, including 
ranges on dead symbols. These are *the* pre-conditions[2] for the 
branch. Unless we have sinks, disjunction of these pre-conditions for 
all branches is `true'[3].

For every branch, since RangeConstraintManager maps symbols to ranges, 
we actualize these symbols according to the context. After this, we 
search for ranges on the same symbols in the caller context. If any of 
the new ranges contradict (do not overlap) the new ranges, we throw away 
the whole branch. Hence, we avoid applying impossible branches. It is 
done before any other work on the branch is being done (eg. checkers or 
GDM calls), so it saves time.

It should be possible to optimize this procedure by joining branches 
with completely equal range informations[4], though we haven't coded 
this yet.

This procedure interacts with delayed-checks in a fancy manner, namely 
it opens up warnings like:
   void bar(int *b) {
     foo(b); // [1] calling foo()
   }
   void foo(int *a) {
     *a = 0;
     if (a == 0) {
       // [2] warn: null-dereference
     }
   }
during summary-apply, because constraints on 'a' would be present in the 
summary input constraints. It might or might not be worth it to suppress 
such warnings, because they may be really unobvious, or modify the 
bugreporter to make them more obvious by drawing later paths through the 
callee.

/* [1] Just realized that actually omitting sink-branches is not really 
a sound decision. Other checkers may have delayed checks inside these 
branches, which would cause a warning before the sink, which we loose 
with our approach. */

/* [2] It is possible that checkers add their own checker-specific 
pre-conditions, but we haven't yet encountered any sensible practical 
use-cases. */

/* [3] Note that if the summary is incomplete (eg. we exceeded max-nodes 
or block-count for the summary), then we go for conservative evaluation, 
so all branches are either complete or sinks when we actually apply any 
summary. */

/* [4] This would most likely (though not always) mean that they only 
differ with their GDM information, eg. result of state-split done by 
checkers. */

=== Regarding performance: ===

Anna Zaks via cfe-dev cfe-dev at lists.llvm.org wrote:
 > You can also measure the number of basic blocks visited if you want
 > to compare coverage. Take a look at debug.Stats checker and
 > -analyzer-stats option. These might give you a better understanding
 > of what causes the slowdown.

That's a great idea, we didn't try that yet. We tried to estimate node 
count including the "summarized" nodes, and had some funny numbers (eg. 
summary analyzes ~20 times more nodes per second).

Anna Zaks via cfe-dev cfe-dev at lists.llvm.org wrote:
 > We use various heuristics for balancing precision and performance for
 > inlining. We do not actually inline all functions. This is why you
 > see more reports with your summaries. Theoretically, inlining should
 > always be more precise than any summary. There are several parameters
 > controlling this apart from max-nodes:
 > - We do not inline functions that are too large -analyzer-config
 > max-inclinable-size (50 basic blocks).
 > - We do not inline large functions(min-cfg-size-treat-functions-
 > as-large) too many times (max-times-inline-large).

We hacked directly into inlineCall(), so we should be ipa-calling the 
same set of functions, i guess. However, we didn't yet work on the 
BifurcateCall mechanism.

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > You mentioned in a follow-up email that you see a ~5x increase in
 > memory usage with your prototype. Do have a sense of where this cost
 > comes from? Is it keeping the summaries themselves?

Partially from storing summaries. Partially from the fact that we 
actually do a lot more analysis; with the same max-nodes, we go through 
a lot more paths and create heavier nodes with more stuff in their 
states and such. So it's not really correct to compare analysis runs for 
summary and inlining with same max-nodes: by design, summary would be 
slower and heavier in such case, while finding deeper defects.

Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
 > You can possibly trim the summary down substantially by doing a BFS
 > from the exit/error nodes of the graph to the entrance nodes. We use
 > this trick when displaying trimmed graphs in GraphViz to show a path
 > to an error node. That could result in a massive savings.

Yeah, that's one more possible memory usage optimization, thanks for the 
idea! Other things that we tried include removing some 
summary-irrelevant program state traits in checkEndAnalysis, reclaiming 
exploded graph nodes more aggressively than usual when summary 
construction is over, re-using memory allocators across analyses (most 
importantly the exploded node allocator, otherwise freeing nodes doesn't 
save any memory) etc.

=== Miscellaneous: ===

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
 > > Step (c.ii): Actualize constraint keys, intersect "new" and "old"
 > > ranges, throw away the branch if intersection is empty, otherwise
 > > constrain the actual symbol with the intersected range. As we
 > > mentioned on step (a), this requires a separate pass through reaped
 > > symbols which are stored separately.
 > Does this modify the summary? Or do you make a copy?

We create a new ProgramState that is based on the original 
caller-context state, with new range constraints for the symbols (either 
add new (symbol, range) pairs, if the symbol was not yet constrained, or 
replace the range, if the intersection was computed).

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > > Step (c.iv): Path-sensitive bug reports are constructed by
 > > frankensteining the report path from different pieces of different
 > > exploded graphs, be it a report that originates from the callee (and
 > > hence thrown on step (c.iii) from eval::SummaryApply), or an
 > > intermediate call that exited before the bug (which may be
 > > interesting for the bug reporter visitors). The only trouble we
 > > currently have is the "Assuming..." pieces, which may be redundant,
 > > as no assumption was made, but rather the context determined the
 > > branch.
 > In some cases it seems like it ought to be possible to query the
 > state at “assuming…” to determine whether the context always implies
 > that the assumption holds and elide the piece if not needed. Does the
 > context always determine that the branch? Similarly, can there be
 > cases where a path is dead (the branch could not be taken based on
 > the actual) but the actualized post-state does not have enough
 > information about to yield a contradiction? If I understand
 > correctly, this is why you keep the ranges of dead symbols around?
 > Would you also need need to keep checker state for these symbols
 > around so they can detect contradictions when applying the summary?

Yeah, this problem can be solved this way! We just didn't yet take up 
this task. However, storing ranges for dead symbols is inevitable 
anyway, as there may be a lot of significant pre-conditions (eg. the 
function relies on value of a global variable and then overwrites it; 
the value symbol of this variable dies, however we still need to discard 
branches if we know something about this symbol in the caller context).

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > > - Only default implementations of Store Manager and Constraint
 > > Manager are supported;
 > By this you mean the RegionStoreManager and RangeConstraintManager,
 > right?

Yes, right; we didn't yet think deeply about how exactly to abstract 
away from the fact that RangeConstraintManager uses symbol-range pairs, 
and RegionStoreManager uses base-offset-value triples, what sort of 
virtual functions should be defined in the abstract ConstraintManager 
and StoreManager classes in order to stay as flexible as possible.

Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
 > > Step (c.v): Actualize the binding keys and values, overwrite
 > > existing values. Invalidation, as in writing some SymbolConjured to
 > > some memory space, is handled similarly. before all bindings. We do
 > > this last, because it would affect actualization (eg.
 > > SymbolRegionValue should actualize itself to the value of the
 > > region before the call, not after the call; quite a sensitive point
 > > here, but seems to work this way).
 > Given the region store’s differing representations of concrete and
 > symbolic offsets, this sounds like it could be quite tricky. For
 > example, does this work for parameter-dependent offsets in the
 > summary post-state where in the summary the parameter is symbolic but
 > at the call site the parameter is concrete? More generally, I’d love
 > to hear more about how you apply summaries to the region store, how
 > you identify the frame and the footprint, how you unify symbolic
 > variables, etc.

Unfortunately, we did not yet handle symbolic offsets in the store 
properly, and some parts of it actually make a nice problem (also it's 
surprising that we didn't yet run into too many issues because of this 
stuff).

For concrete offsets, we had to extend the StoreManager::BindingsHandler 
to provide complete information about the binding keys, not just the 
base. Then we actualize the base of the binding. Then we split the 
actualized binding base into its own base and offset, because a 
base-region may actualize into a non-base-region. Then we *add* the 
offset of the actualized base to the original offset of the binding, and 
obtain the complete offset of the actualized binding. Then we actualize 
the store value and bind it to the actualized binding key, probably 
replacing the existing binding.

We also skip bindings to stack locals, and discriminate between direct 
and default bindings. Another important thing to notice is that when we 
actualize all bindings, we use the original store (without any new 
bindings) as the reference.

For symbolic offsets in the summary, i guess we should have done a 
similar thing. Take the binding key, which is base region and its 
sub-region representing a symbolic offset. Then discard the base region 
and take only the offset region, and actualize it. Then split the 
actualized region into base and offset, which can be either concrete or 
symbolic. The actualization procedure would automatically replace 
symbolic index inside any element region with concrete index if the 
symbolic index is constrained to a constant or actualized into a 
constant. Additionally, if we have (a+i)+j, where 'a' is an array, '+i' 
is done in the caller context, and '+j' is done in the callee context, 
then the actualization procedure would covert it automatically into 
'a+(i+j)' (the actualization procedure detects such cases when it tries 
to actualize the super-region of an ElementRegion and obtains another 
ElementRegion as an answer) which allows easily adding symbolic offsets 
where necessary. Then actualize the value and bind it to the actual key.

Well, the real problem arises when we have a concrete offset in the 
summary, but the actualized base has symbolic offset. Then we somehow 
need to add an integer to a region, and i've no idea how to do that, 
without adding some new kind of key into the region store.

Once we tried to solve the actualization of the store keys via restoring 
the original sub-region by knowing the base and the offset. This 
almost-worked, but was extremely hacky and inevitably still unreliable, 
so we discarded this approach and went for simplier conversions of store 
keys.

So this is still a point for discussion.

=== Other stuff i forgot to mention earlier: ===

We had to make a modification of SVal hierarchy in order to make 
actualization work. Namely, we introduced an auxiliary atomic symbol 
kind, called SymbolRegionAddress, which represents the unknown address 
of a non-symbolic region. Why do we need that? Consider an example:
   void foo(int *ptr) {
     if (!ptr) {}
   }
   void bar() {
     int a = 1;
     foo(&a);
   }
The summary of the function foo() contains ranges on reg_$0<ptr>. When 
we actualize this range, we figure out that we have no good symbol kind 
to represent the result, to carry the range, etc. So we call the 
resulting symbol "addr_$0<a>". With this, by looking at the kind of the 
symbol we receive, we can discard the "ptr == 0" branch of the summary, 
even though we never obtain the "&a" as the value of the condition (only 
symbols are stored in RangeConstraintManager). Additionally, such symbol 
is extremely convenient when we think of actualization as a recursive 
procedure: if we want to actualize &SymRegion{reg_$0<ptr>} to &a, which 
is the right thing to do, then to what do we recursively actualize 
reg_$0<ptr>? We also implemented the same hack with label addresses, 
creating SymbolLabelAddress. Well, this funny implementation detail is 
probably worth another discussion and a separate commit, maybe if i 
think deeper i can probably even find inlining-related test cases to 
support this feature.