[cfe-dev] [analyzer] Summary IPA thoughts

Wed Nov 4 16:26:05 PST 2015

Hi Artem and Alexey,

I have not looked at the implementation yet but here are some higher-level comments about the approach, based on a discussion I had with Ted and Devin.

We do believe that summary based IPA is required in order to scale the analyzer to work across translation unit boundaries (whole project analysis), so it is great to see progress made in this direction.

Here are the strong goals that we believe in:
- Suitable for XTU analysis.
- Checkers should be very easy to write. There are many more checker developers (then analyzer core developers) so we should make checker development as easy as possible.

-----------
There are these 2 mainstream approaches to summary based analysis. 
1) General summary approach is when we make no assumptions about context (precondition is true). We generate the summaries by analyzing the functions at top-level and re-analyze the code using the generated summary. You can either stop after the second iteration or iterate until a fixed point is reached.
  - This is the summary approach you are currently using.
  - Extending this to XTU analysis is simpler than the other approaches. For example, we could get away with performing just 2 iterations through all translation units. 
 - We can use bottom up processing order for visiting functions.
 - Modeling the heap is tricky because there is no information about the heap when the summary is generated.
 - The main limitation is how it deals with reasoning about the “Unknowns" in the checkers. The special handling of unknowns is prevalent throughout the checkers. If we just run the analyzer on the function to generate a summary, the summary would perform a very conservative modeling of the Unknowns, which is likely to result in false negatives. We need to either:
Split the state on all Unknowns, which is horrible for performance.
Replay the decisions that occur on Unknowns, which complicates the checker API. Also, how would we deal with interdependent checkers and symbol invalidation?
There is information loss on replay after the fact. It’s hard to ensure that the result of the replay is exactly the same as evaluating the function in a given context.
Replay would also result in some runtime overhead.
2) Context sensitive summary approach based on assumption sets is where the summaries are generated lazily. When we evaluate a call site, we check if we already have a summary that is suitable for being used there (a summary with a matching pre-condition) and generate one if it is not available. 
 - This approach solves the “Unknowns” issue by design.
 - Evaluation order could be similar to inlining, but there is no clear/guaranteed two-phase approach.
 - How do we decide if the summary is suitable/a precondition is matching? It’s an open question, but here are some ideas:
Using the entire entry state as a precondition will result in too many summaries.
We could record the conditions assumed along the path and use that as a summary.
Do we use the entry state somehow? We probably want to track anything on the heap that is used.
We could throw away constraints on unused symbols.
We should also keep the symbols that the checkers touch.
State is not enough with context sensitive approach, you need to look at guards that were evaluated:
prefoo {x==1}
foo() {
  if (x > 2) {
    sum1;
  } else {
    sum2;
  }
}
-----------
How do we deal with the heap in either approach? Include heap in the summary? A challenge is how to make sure the summary is not tied to the specific representation of the heap. How are you dealing with the heap during summary application?

Symbols might get invalidated/escape during a function call. How should we deal with symbol invalidation?
- Note, that invalidating a symbol can invalidate the reachable heap. 
- In general summary case, how do we invalidate if we don’t know what the heap looks like? 
- In general summary approach, we need to replay it in the right order relative to the other "events".
- In assumption sets approach, invalidation is part of alpha renaming. 

My intuition is that assumption sets based approach will yield better results and would allow us to keep checker development easy. Though, it might require more work to extend it to deal with XTU analysis.

Sorry for the delayed response. We can schedule a Skype call if you want to talk about this more face-to-face.

Cheers.
Anna.

> On Oct 22, 2015, at 5:11 AM, Artem Dergachev via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> Wow, that's a lot of stuff!
> Thanks a lot for all your encouragement!
> 
> 
> === Regarding inter-unit and purposes: ===
> 
> 
> Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> > Is this within a translation unit? Are you planning to make this
> > method usable across translation units?
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > This sounds interesting! If I understand correctly, your goal was to
> > improve analyzer performance by avoiding inlining? Could we also use
> > this high-level approach to enable cross-translation unit analysis?
> 
> Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
> > Is the participation needed for serializing out state if we want to
> > make these summaries persistent?
> 
> 
> We do eventually aim for cross-translation-unit analysis as well, however it's quite a bit of a separate story. Oh, i guess Alexey has already covered that. Apart from this, of course, we also desire to improve time-performance of path-sensitive checks inside a single translation unit, because it seems to make some sense when the analyzer is integrated into build systems of large projects, and such, so we wish to reduce analysis overhead in such use-cases.
> 
> We did not yet try out any sort of serialization of summaries. Once we do, then, yeah, most likely, we'd probably need extra checker-side support to serialize their data.
> 
> 
> === Regarding the checker-side support: ===
> 
> 
> Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> > What makes the help from the checkers inevitable here?
> 
> Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
> > What kind of participation is needed from a checker author to
> > participate in this design? Since it is based on the ExplodedGraph,
> > it would seem their participation would be automatic.
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > Can you tell us more about these delayed checks? I’m particularly
> > wondering about type-state-like checks that don’t issue warnings on
> > unknown values.
> >
> > For example, suppose you had a checker like SimpleStreamChecker that
> > issues a warning on a double close. If a symbol ref for a file
> > descriptor is marked as closed in its state then the checker issues a
> > warning, but if the symbol ref is unknown, it just transitions it to
> > the closed type state.
> >
> > So it you had:
> >
> > FILE *f = ...
> > fclose(f)
> > myFunction(f);
> >
> > where
> >
> > void myFunction(FILE *p) {
> > fclose(p)
> > }
> >
> > Then with inlining you would get a diagnostic about a double close at
> > fclose(p). But with a summary-based approach like you suggest the
> > unmodified SimpleStreamChecker wouldn’t issue a warning. How would
> > you recommend changing such a checker to handle summaries? Are there
> > common patterns here that we could expose to checker writers?
> 
> 
> Precisely, the code sample provided by Devin Coughlin, together with the "opposite" sample:
>  FILE *f = ...;
>  myFunction(f);
>  fclose(f);
> , is the simplest motivational example that shows that checker-side support is inevitable. There's no defect in myFunction() itself, however it changes the GDM (which is visible to checkers only) to cause the warning.
> 
> We did plan to make some "How to add summary IPA support to your checker in 24 hours" how-to :)
> 
> The modification needed/proposed-by-our-approach to let SimpleStreamChecker conduct its double-close checks with summary is as follows:
> 
>  1. Store a list of all open/close operations in the program state as a separate list-type trait, with node pointers, pointer symbols, and operation types.
>  2. In summary-apply callback, iterate over this list found in the summary leaf, actualize symbols, and re-model the operations on top of the caller state:
>    2a. See if the operation is valid, report double-close if not; supply the stored node to the bug reporter so that it could glue the report path together in the Devin Coughlin's example and throw the report against a correct node; storing nodes is also useful for implementing bug visitors.
>    2b. Mark the relevant symbols as open or closed in the caller context. This would let us handle the "opposite" example, which would throw the report in the caller context after changes in GDM caused by the callee.
>    2c. Repeat these changes in the caller's operation list, so that deeper call stacks would be supported. If bug report is constructed through a deep call stack, the checker would need to provide BugReporter with a stack of nodes.
> 
> Well, roughly, these things. Most of the checkers that work with GDM in a similar manner would require very similar facilities, in fact such approach worked for ~6 pretty checkers we tried it upon, some of which were pretty weird (eg. statistical analysis of the whole graph) (we measure productivity on only two of them, because other checkers are too quiet).
> 
> It's not as scary as it sounds, probably ~20 lines of code per checker with small number of GDM entries.
> 
> On how to avoid modifying checkers: it might be possible to simplify this procedure significantly, but only by throwing away the idea of the opaque *Generic* data map in the state, replacing it with a limited set of structures to allow handling most of the actualization by the analyzer engine. This approach seems worse, as it greatly decreases the flexibility, limiting the number of checkers we can implement without adding stuff into core, so we quickly realized that it's not really worth trying.
> 
> So i think yeah, modifying checkers is inevitable. They need to have their own summary of the branch, need to know how to apply it.
> 
> Todo for myself: we didn't yet have a chance to think deeply about leak-related checks, eg. lifetime of symbols that were created during actualization, or how symbols leak through/inside summaries, or probably avoid actualizing symbols that would die instantly.
> 
> 
> === Regarding the ranges and reachibility: ===
> 
> 
> Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> > I think reachability is a very important question. How do you
> > associate the symbols, the constraints and the branches?
> 
> Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> > In general, it is great, to have a summary based approach to
> > interprocedural analysis, but could you list the high level goals of
> > your approach and what improvements are you expecting?
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > One consequence of analyzing the function at the top level is that
> > the summary produced makes no assumptions about context. This means
> > the summary is always applicable (i.e., the precondition holds) but
> > also that applying it may yield many more paths than are actually
> > feasible.
> 
> Yeah, so what we do:: We consider each non-sink[1] leaf in the ExplodedGraph of the summary separately. In this leaf, we take all info stored by RangeConstraintManager - ranges on all symbols, including ranges on dead symbols. These are *the* pre-conditions[2] for the branch. Unless we have sinks, disjunction of these pre-conditions for all branches is `true'[3].
> 
> For every branch, since RangeConstraintManager maps symbols to ranges, we actualize these symbols according to the context. After this, we search for ranges on the same symbols in the caller context. If any of the new ranges contradict (do not overlap) the new ranges, we throw away the whole branch. Hence, we avoid applying impossible branches. It is done before any other work on the branch is being done (eg. checkers or GDM calls), so it saves time.
> 
> It should be possible to optimize this procedure by joining branches with completely equal range informations[4], though we haven't coded this yet.
> 
> This procedure interacts with delayed-checks in a fancy manner, namely it opens up warnings like:
>  void bar(int *b) {
>    foo(b); // [1] calling foo()
>  }
>  void foo(int *a) {
>    *a = 0;
>    if (a == 0) {
>      // [2] warn: null-dereference
>    }
>  }
> during summary-apply, because constraints on 'a' would be present in the summary input constraints. It might or might not be worth it to suppress such warnings, because they may be really unobvious, or modify the bugreporter to make them more obvious by drawing later paths through the callee.
> 
> /* [1] Just realized that actually omitting sink-branches is not really a sound decision. Other checkers may have delayed checks inside these branches, which would cause a warning before the sink, which we loose with our approach. */
> 
> /* [2] It is possible that checkers add their own checker-specific pre-conditions, but we haven't yet encountered any sensible practical use-cases. */
> 
> /* [3] Note that if the summary is incomplete (eg. we exceeded max-nodes or block-count for the summary), then we go for conservative evaluation, so all branches are either complete or sinks when we actually apply any summary. */
> 
> /* [4] This would most likely (though not always) mean that they only differ with their GDM information, eg. result of state-split done by checkers. */
> 
> 
> === Regarding performance: ===
> 
> 
> Anna Zaks via cfe-dev cfe-dev at lists.llvm.org wrote:
> > You can also measure the number of basic blocks visited if you want
> > to compare coverage. Take a look at debug.Stats checker and
> > -analyzer-stats option. These might give you a better understanding
> > of what causes the slowdown.
> 
> That's a great idea, we didn't try that yet. We tried to estimate node count including the "summarized" nodes, and had some funny numbers (eg. summary analyzes ~20 times more nodes per second).
> 
> 
> Anna Zaks via cfe-dev cfe-dev at lists.llvm.org wrote:
> > We use various heuristics for balancing precision and performance for
> > inlining. We do not actually inline all functions. This is why you
> > see more reports with your summaries. Theoretically, inlining should
> > always be more precise than any summary. There are several parameters
> > controlling this apart from max-nodes:
> > - We do not inline functions that are too large -analyzer-config
> > max-inclinable-size (50 basic blocks).
> > - We do not inline large functions(min-cfg-size-treat-functions-
> > as-large) too many times (max-times-inline-large).
> 
> We hacked directly into inlineCall(), so we should be ipa-calling the same set of functions, i guess. However, we didn't yet work on the BifurcateCall mechanism.
> 
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > You mentioned in a follow-up email that you see a ~5x increase in
> > memory usage with your prototype. Do have a sense of where this cost
> > comes from? Is it keeping the summaries themselves?
> 
> Partially from storing summaries. Partially from the fact that we actually do a lot more analysis; with the same max-nodes, we go through a lot more paths and create heavier nodes with more stuff in their states and such. So it's not really correct to compare analysis runs for summary and inlining with same max-nodes: by design, summary would be slower and heavier in such case, while finding deeper defects.
> 
> 
> Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
> > You can possibly trim the summary down substantially by doing a BFS
> > from the exit/error nodes of the graph to the entrance nodes. We use
> > this trick when displaying trimmed graphs in GraphViz to show a path
> > to an error node. That could result in a massive savings.
> 
> Yeah, that's one more possible memory usage optimization, thanks for the idea! Other things that we tried include removing some summary-irrelevant program state traits in checkEndAnalysis, reclaiming exploded graph nodes more aggressively than usual when summary construction is over, re-using memory allocators across analyses (most importantly the exploded node allocator, otherwise freeing nodes doesn't save any memory) etc.
> 
> 
> === Miscellaneous: ===
> 
> 
> Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> > > Step (c.ii): Actualize constraint keys, intersect "new" and "old"
> > > ranges, throw away the branch if intersection is empty, otherwise
> > > constrain the actual symbol with the intersected range. As we
> > > mentioned on step (a), this requires a separate pass through reaped
> > > symbols which are stored separately.
> > Does this modify the summary? Or do you make a copy?
> 
> We create a new ProgramState that is based on the original caller-context state, with new range constraints for the symbols (either add new (symbol, range) pairs, if the symbol was not yet constrained, or replace the range, if the intersection was computed).
> 
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > > Step (c.iv): Path-sensitive bug reports are constructed by
> > > frankensteining the report path from different pieces of different
> > > exploded graphs, be it a report that originates from the callee (and
> > > hence thrown on step (c.iii) from eval::SummaryApply), or an
> > > intermediate call that exited before the bug (which may be
> > > interesting for the bug reporter visitors). The only trouble we
> > > currently have is the "Assuming..." pieces, which may be redundant,
> > > as no assumption was made, but rather the context determined the
> > > branch.
> > In some cases it seems like it ought to be possible to query the
> > state at “assuming…” to determine whether the context always implies
> > that the assumption holds and elide the piece if not needed. Does the
> > context always determine that the branch? Similarly, can there be
> > cases where a path is dead (the branch could not be taken based on
> > the actual) but the actualized post-state does not have enough
> > information about to yield a contradiction? If I understand
> > correctly, this is why you keep the ranges of dead symbols around?
> > Would you also need need to keep checker state for these symbols
> > around so they can detect contradictions when applying the summary?
> 
> Yeah, this problem can be solved this way! We just didn't yet take up this task. However, storing ranges for dead symbols is inevitable anyway, as there may be a lot of significant pre-conditions (eg. the function relies on value of a global variable and then overwrites it; the value symbol of this variable dies, however we still need to discard branches if we know something about this symbol in the caller context).
> 
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > > - Only default implementations of Store Manager and Constraint
> > > Manager are supported;
> > By this you mean the RegionStoreManager and RangeConstraintManager,
> > right?
> 
> Yes, right; we didn't yet think deeply about how exactly to abstract away from the fact that RangeConstraintManager uses symbol-range pairs, and RegionStoreManager uses base-offset-value triples, what sort of virtual functions should be defined in the abstract ConstraintManager and StoreManager classes in order to stay as flexible as possible.
> 
> 
> Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > > Step (c.v): Actualize the binding keys and values, overwrite
> > > existing values. Invalidation, as in writing some SymbolConjured to
> > > some memory space, is handled similarly. before all bindings. We do
> > > this last, because it would affect actualization (eg.
> > > SymbolRegionValue should actualize itself to the value of the
> > > region before the call, not after the call; quite a sensitive point
> > > here, but seems to work this way).
> > Given the region store’s differing representations of concrete and
> > symbolic offsets, this sounds like it could be quite tricky. For
> > example, does this work for parameter-dependent offsets in the
> > summary post-state where in the summary the parameter is symbolic but
> > at the call site the parameter is concrete? More generally, I’d love
> > to hear more about how you apply summaries to the region store, how
> > you identify the frame and the footprint, how you unify symbolic
> > variables, etc.
> 
> Unfortunately, we did not yet handle symbolic offsets in the store properly, and some parts of it actually make a nice problem (also it's surprising that we didn't yet run into too many issues because of this stuff).
> 
> For concrete offsets, we had to extend the StoreManager::BindingsHandler to provide complete information about the binding keys, not just the base. Then we actualize the base of the binding. Then we split the actualized binding base into its own base and offset, because a base-region may actualize into a non-base-region. Then we *add* the offset of the actualized base to the original offset of the binding, and obtain the complete offset of the actualized binding. Then we actualize the store value and bind it to the actualized binding key, probably replacing the existing binding.
> 
> We also skip bindings to stack locals, and discriminate between direct and default bindings. Another important thing to notice is that when we actualize all bindings, we use the original store (without any new bindings) as the reference.
> 
> For symbolic offsets in the summary, i guess we should have done a similar thing. Take the binding key, which is base region and its sub-region representing a symbolic offset. Then discard the base region and take only the offset region, and actualize it. Then split the actualized region into base and offset, which can be either concrete or symbolic. The actualization procedure would automatically replace symbolic index inside any element region with concrete index if the symbolic index is constrained to a constant or actualized into a constant. Additionally, if we have (a+i)+j, where 'a' is an array, '+i' is done in the caller context, and '+j' is done in the callee context, then the actualization procedure would covert it automatically into 'a+(i+j)' (the actualization procedure detects such cases when it tries to actualize the super-region of an ElementRegion and obtains another ElementRegion as an answer) which allows easily adding symbolic offsets where necessary. Then actualize the value and bind it to the actual key.
> 
> Well, the real problem arises when we have a concrete offset in the summary, but the actualized base has symbolic offset. Then we somehow need to add an integer to a region, and i've no idea how to do that, without adding some new kind of key into the region store.
> 
> Once we tried to solve the actualization of the store keys via restoring the original sub-region by knowing the base and the offset. This almost-worked, but was extremely hacky and inevitably still unreliable, so we discarded this approach and went for simplier conversions of store keys.
> 
> So this is still a point for discussion.
> 
> 
> === Other stuff i forgot to mention earlier: ===
> 
> 
> We had to make a modification of SVal hierarchy in order to make actualization work. Namely, we introduced an auxiliary atomic symbol kind, called SymbolRegionAddress, which represents the unknown address of a non-symbolic region. Why do we need that? Consider an example:
>  void foo(int *ptr) {
>    if (!ptr) {}
>  }
>  void bar() {
>    int a = 1;
>    foo(&a);
>  }
> The summary of the function foo() contains ranges on reg_$0<ptr>. When we actualize this range, we figure out that we have no good symbol kind to represent the result, to carry the range, etc. So we call the resulting symbol "addr_$0<a>". With this, by looking at the kind of the symbol we receive, we can discard the "ptr == 0" branch of the summary, even though we never obtain the "&a" as the value of the condition (only symbols are stored in RangeConstraintManager). Additionally, such symbol is extremely convenient when we think of actualization as a recursive procedure: if we want to actualize &SymRegion{reg_$0<ptr>} to &a, which is the right thing to do, then to what do we recursively actualize reg_$0<ptr>? We also implemented the same hack with label addresses, creating SymbolLabelAddress. Well, this funny implementation detail is probably worth another discussion and a separate commit, maybe if i think deeper i can probably even find inlining-related test cases to support this feature.
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151104/818a6b38/attachment.html>