[cfe-dev] [analyzer] Summary IPA thoughts
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Thu Oct 22 05:11:01 PDT 2015
Wow, that's a lot of stuff!
Thanks a lot for all your encouragement!
=== Regarding inter-unit and purposes: ===
Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> Is this within a translation unit? Are you planning to make this
> method usable across translation units?
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> This sounds interesting! If I understand correctly, your goal was to
> improve analyzer performance by avoiding inlining? Could we also use
> this high-level approach to enable cross-translation unit analysis?
Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
> Is the participation needed for serializing out state if we want to
> make these summaries persistent?
We do eventually aim for cross-translation-unit analysis as well,
however it's quite a bit of a separate story. Oh, i guess Alexey has
already covered that. Apart from this, of course, we also desire to
improve time-performance of path-sensitive checks inside a single
translation unit, because it seems to make some sense when the analyzer
is integrated into build systems of large projects, and such, so we wish
to reduce analysis overhead in such use-cases.
We did not yet try out any sort of serialization of summaries. Once we
do, then, yeah, most likely, we'd probably need extra checker-side
support to serialize their data.
=== Regarding the checker-side support: ===
Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> What makes the help from the checkers inevitable here?
Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
> What kind of participation is needed from a checker author to
> participate in this design? Since it is based on the ExplodedGraph,
> it would seem their participation would be automatic.
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> Can you tell us more about these delayed checks? I’m particularly
> wondering about type-state-like checks that don’t issue warnings on
> unknown values.
>
> For example, suppose you had a checker like SimpleStreamChecker that
> issues a warning on a double close. If a symbol ref for a file
> descriptor is marked as closed in its state then the checker issues a
> warning, but if the symbol ref is unknown, it just transitions it to
> the closed type state.
>
> So it you had:
>
> FILE *f = ...
> fclose(f)
> myFunction(f);
>
> where
>
> void myFunction(FILE *p) {
> fclose(p)
> }
>
> Then with inlining you would get a diagnostic about a double close at
> fclose(p). But with a summary-based approach like you suggest the
> unmodified SimpleStreamChecker wouldn’t issue a warning. How would
> you recommend changing such a checker to handle summaries? Are there
> common patterns here that we could expose to checker writers?
Precisely, the code sample provided by Devin Coughlin, together with the
"opposite" sample:
FILE *f = ...;
myFunction(f);
fclose(f);
, is the simplest motivational example that shows that checker-side
support is inevitable. There's no defect in myFunction() itself, however
it changes the GDM (which is visible to checkers only) to cause the warning.
We did plan to make some "How to add summary IPA support to your checker
in 24 hours" how-to :)
The modification needed/proposed-by-our-approach to let
SimpleStreamChecker conduct its double-close checks with summary is as
follows:
1. Store a list of all open/close operations in the program state as
a separate list-type trait, with node pointers, pointer symbols, and
operation types.
2. In summary-apply callback, iterate over this list found in the
summary leaf, actualize symbols, and re-model the operations on top of
the caller state:
2a. See if the operation is valid, report double-close if not;
supply the stored node to the bug reporter so that it could glue the
report path together in the Devin Coughlin's example and throw the
report against a correct node; storing nodes is also useful for
implementing bug visitors.
2b. Mark the relevant symbols as open or closed in the caller
context. This would let us handle the "opposite" example, which would
throw the report in the caller context after changes in GDM caused by
the callee.
2c. Repeat these changes in the caller's operation list, so that
deeper call stacks would be supported. If bug report is constructed
through a deep call stack, the checker would need to provide BugReporter
with a stack of nodes.
Well, roughly, these things. Most of the checkers that work with GDM in
a similar manner would require very similar facilities, in fact such
approach worked for ~6 pretty checkers we tried it upon, some of which
were pretty weird (eg. statistical analysis of the whole graph) (we
measure productivity on only two of them, because other checkers are too
quiet).
It's not as scary as it sounds, probably ~20 lines of code per checker
with small number of GDM entries.
On how to avoid modifying checkers: it might be possible to simplify
this procedure significantly, but only by throwing away the idea of the
opaque *Generic* data map in the state, replacing it with a limited set
of structures to allow handling most of the actualization by the
analyzer engine. This approach seems worse, as it greatly decreases the
flexibility, limiting the number of checkers we can implement without
adding stuff into core, so we quickly realized that it's not really
worth trying.
So i think yeah, modifying checkers is inevitable. They need to have
their own summary of the branch, need to know how to apply it.
Todo for myself: we didn't yet have a chance to think deeply about
leak-related checks, eg. lifetime of symbols that were created during
actualization, or how symbols leak through/inside summaries, or probably
avoid actualizing symbols that would die instantly.
=== Regarding the ranges and reachibility: ===
Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> I think reachability is a very important question. How do you
> associate the symbols, the constraints and the branches?
Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> In general, it is great, to have a summary based approach to
> interprocedural analysis, but could you list the high level goals of
> your approach and what improvements are you expecting?
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> One consequence of analyzing the function at the top level is that
> the summary produced makes no assumptions about context. This means
> the summary is always applicable (i.e., the precondition holds) but
> also that applying it may yield many more paths than are actually
> feasible.
Yeah, so what we do:: We consider each non-sink[1] leaf in the
ExplodedGraph of the summary separately. In this leaf, we take all info
stored by RangeConstraintManager - ranges on all symbols, including
ranges on dead symbols. These are *the* pre-conditions[2] for the
branch. Unless we have sinks, disjunction of these pre-conditions for
all branches is `true'[3].
For every branch, since RangeConstraintManager maps symbols to ranges,
we actualize these symbols according to the context. After this, we
search for ranges on the same symbols in the caller context. If any of
the new ranges contradict (do not overlap) the new ranges, we throw away
the whole branch. Hence, we avoid applying impossible branches. It is
done before any other work on the branch is being done (eg. checkers or
GDM calls), so it saves time.
It should be possible to optimize this procedure by joining branches
with completely equal range informations[4], though we haven't coded
this yet.
This procedure interacts with delayed-checks in a fancy manner, namely
it opens up warnings like:
void bar(int *b) {
foo(b); // [1] calling foo()
}
void foo(int *a) {
*a = 0;
if (a == 0) {
// [2] warn: null-dereference
}
}
during summary-apply, because constraints on 'a' would be present in the
summary input constraints. It might or might not be worth it to suppress
such warnings, because they may be really unobvious, or modify the
bugreporter to make them more obvious by drawing later paths through the
callee.
/* [1] Just realized that actually omitting sink-branches is not really
a sound decision. Other checkers may have delayed checks inside these
branches, which would cause a warning before the sink, which we loose
with our approach. */
/* [2] It is possible that checkers add their own checker-specific
pre-conditions, but we haven't yet encountered any sensible practical
use-cases. */
/* [3] Note that if the summary is incomplete (eg. we exceeded max-nodes
or block-count for the summary), then we go for conservative evaluation,
so all branches are either complete or sinks when we actually apply any
summary. */
/* [4] This would most likely (though not always) mean that they only
differ with their GDM information, eg. result of state-split done by
checkers. */
=== Regarding performance: ===
Anna Zaks via cfe-dev cfe-dev at lists.llvm.org wrote:
> You can also measure the number of basic blocks visited if you want
> to compare coverage. Take a look at debug.Stats checker and
> -analyzer-stats option. These might give you a better understanding
> of what causes the slowdown.
That's a great idea, we didn't try that yet. We tried to estimate node
count including the "summarized" nodes, and had some funny numbers (eg.
summary analyzes ~20 times more nodes per second).
Anna Zaks via cfe-dev cfe-dev at lists.llvm.org wrote:
> We use various heuristics for balancing precision and performance for
> inlining. We do not actually inline all functions. This is why you
> see more reports with your summaries. Theoretically, inlining should
> always be more precise than any summary. There are several parameters
> controlling this apart from max-nodes:
> - We do not inline functions that are too large -analyzer-config
> max-inclinable-size (50 basic blocks).
> - We do not inline large functions(min-cfg-size-treat-functions-
> as-large) too many times (max-times-inline-large).
We hacked directly into inlineCall(), so we should be ipa-calling the
same set of functions, i guess. However, we didn't yet work on the
BifurcateCall mechanism.
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> You mentioned in a follow-up email that you see a ~5x increase in
> memory usage with your prototype. Do have a sense of where this cost
> comes from? Is it keeping the summaries themselves?
Partially from storing summaries. Partially from the fact that we
actually do a lot more analysis; with the same max-nodes, we go through
a lot more paths and create heavier nodes with more stuff in their
states and such. So it's not really correct to compare analysis runs for
summary and inlining with same max-nodes: by design, summary would be
slower and heavier in such case, while finding deeper defects.
Ted Kremenek via cfe-dev cfe-dev at lists.llvm.org wrote:
> You can possibly trim the summary down substantially by doing a BFS
> from the exit/error nodes of the graph to the entrance nodes. We use
> this trick when displaying trimmed graphs in GraphViz to show a path
> to an error node. That could result in a massive savings.
Yeah, that's one more possible memory usage optimization, thanks for the
idea! Other things that we tried include removing some
summary-irrelevant program state traits in checkEndAnalysis, reclaiming
exploded graph nodes more aggressively than usual when summary
construction is over, re-using memory allocators across analyses (most
importantly the exploded node allocator, otherwise freeing nodes doesn't
save any memory) etc.
=== Miscellaneous: ===
Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org wrote:
> > Step (c.ii): Actualize constraint keys, intersect "new" and "old"
> > ranges, throw away the branch if intersection is empty, otherwise
> > constrain the actual symbol with the intersected range. As we
> > mentioned on step (a), this requires a separate pass through reaped
> > symbols which are stored separately.
> Does this modify the summary? Or do you make a copy?
We create a new ProgramState that is based on the original
caller-context state, with new range constraints for the symbols (either
add new (symbol, range) pairs, if the symbol was not yet constrained, or
replace the range, if the intersection was computed).
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > Step (c.iv): Path-sensitive bug reports are constructed by
> > frankensteining the report path from different pieces of different
> > exploded graphs, be it a report that originates from the callee (and
> > hence thrown on step (c.iii) from eval::SummaryApply), or an
> > intermediate call that exited before the bug (which may be
> > interesting for the bug reporter visitors). The only trouble we
> > currently have is the "Assuming..." pieces, which may be redundant,
> > as no assumption was made, but rather the context determined the
> > branch.
> In some cases it seems like it ought to be possible to query the
> state at “assuming…” to determine whether the context always implies
> that the assumption holds and elide the piece if not needed. Does the
> context always determine that the branch? Similarly, can there be
> cases where a path is dead (the branch could not be taken based on
> the actual) but the actualized post-state does not have enough
> information about to yield a contradiction? If I understand
> correctly, this is why you keep the ranges of dead symbols around?
> Would you also need need to keep checker state for these symbols
> around so they can detect contradictions when applying the summary?
Yeah, this problem can be solved this way! We just didn't yet take up
this task. However, storing ranges for dead symbols is inevitable
anyway, as there may be a lot of significant pre-conditions (eg. the
function relies on value of a global variable and then overwrites it;
the value symbol of this variable dies, however we still need to discard
branches if we know something about this symbol in the caller context).
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > - Only default implementations of Store Manager and Constraint
> > Manager are supported;
> By this you mean the RegionStoreManager and RangeConstraintManager,
> right?
Yes, right; we didn't yet think deeply about how exactly to abstract
away from the fact that RangeConstraintManager uses symbol-range pairs,
and RegionStoreManager uses base-offset-value triples, what sort of
virtual functions should be defined in the abstract ConstraintManager
and StoreManager classes in order to stay as flexible as possible.
Devin Coughlin via cfe-dev cfe-dev at lists.llvm.org wrote:
> > Step (c.v): Actualize the binding keys and values, overwrite
> > existing values. Invalidation, as in writing some SymbolConjured to
> > some memory space, is handled similarly. before all bindings. We do
> > this last, because it would affect actualization (eg.
> > SymbolRegionValue should actualize itself to the value of the
> > region before the call, not after the call; quite a sensitive point
> > here, but seems to work this way).
> Given the region store’s differing representations of concrete and
> symbolic offsets, this sounds like it could be quite tricky. For
> example, does this work for parameter-dependent offsets in the
> summary post-state where in the summary the parameter is symbolic but
> at the call site the parameter is concrete? More generally, I’d love
> to hear more about how you apply summaries to the region store, how
> you identify the frame and the footprint, how you unify symbolic
> variables, etc.
Unfortunately, we did not yet handle symbolic offsets in the store
properly, and some parts of it actually make a nice problem (also it's
surprising that we didn't yet run into too many issues because of this
stuff).
For concrete offsets, we had to extend the StoreManager::BindingsHandler
to provide complete information about the binding keys, not just the
base. Then we actualize the base of the binding. Then we split the
actualized binding base into its own base and offset, because a
base-region may actualize into a non-base-region. Then we *add* the
offset of the actualized base to the original offset of the binding, and
obtain the complete offset of the actualized binding. Then we actualize
the store value and bind it to the actualized binding key, probably
replacing the existing binding.
We also skip bindings to stack locals, and discriminate between direct
and default bindings. Another important thing to notice is that when we
actualize all bindings, we use the original store (without any new
bindings) as the reference.
For symbolic offsets in the summary, i guess we should have done a
similar thing. Take the binding key, which is base region and its
sub-region representing a symbolic offset. Then discard the base region
and take only the offset region, and actualize it. Then split the
actualized region into base and offset, which can be either concrete or
symbolic. The actualization procedure would automatically replace
symbolic index inside any element region with concrete index if the
symbolic index is constrained to a constant or actualized into a
constant. Additionally, if we have (a+i)+j, where 'a' is an array, '+i'
is done in the caller context, and '+j' is done in the callee context,
then the actualization procedure would covert it automatically into
'a+(i+j)' (the actualization procedure detects such cases when it tries
to actualize the super-region of an ElementRegion and obtains another
ElementRegion as an answer) which allows easily adding symbolic offsets
where necessary. Then actualize the value and bind it to the actual key.
Well, the real problem arises when we have a concrete offset in the
summary, but the actualized base has symbolic offset. Then we somehow
need to add an integer to a region, and i've no idea how to do that,
without adding some new kind of key into the region store.
Once we tried to solve the actualization of the store keys via restoring
the original sub-region by knowing the base and the offset. This
almost-worked, but was extremely hacky and inevitably still unreliable,
so we discarded this approach and went for simplier conversions of store
keys.
So this is still a point for discussion.
=== Other stuff i forgot to mention earlier: ===
We had to make a modification of SVal hierarchy in order to make
actualization work. Namely, we introduced an auxiliary atomic symbol
kind, called SymbolRegionAddress, which represents the unknown address
of a non-symbolic region. Why do we need that? Consider an example:
void foo(int *ptr) {
if (!ptr) {}
}
void bar() {
int a = 1;
foo(&a);
}
The summary of the function foo() contains ranges on reg_$0<ptr>. When
we actualize this range, we figure out that we have no good symbol kind
to represent the result, to carry the range, etc. So we call the
resulting symbol "addr_$0<a>". With this, by looking at the kind of the
symbol we receive, we can discard the "ptr == 0" branch of the summary,
even though we never obtain the "&a" as the value of the condition (only
symbols are stored in RangeConstraintManager). Additionally, such symbol
is extremely convenient when we think of actualization as a recursive
procedure: if we want to actualize &SymRegion{reg_$0<ptr>} to &a, which
is the right thing to do, then to what do we recursively actualize
reg_$0<ptr>? We also implemented the same hack with label addresses,
creating SymbolLabelAddress. Well, this funny implementation detail is
probably worth another discussion and a separate commit, maybe if i
think deeper i can probably even find inlining-related test cases to
support this feature.
More information about the cfe-dev
mailing list