[cfe-dev] [analyzer] Summary IPA thoughts
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Tue Nov 10 08:50:55 PST 2015
Hello Anna,
Thanks a lot for another detailed reply!
It took a while for us to think through the idea of "common-law"
summaries, as opposed to our current "civil-law" summaries, and it still
sounds as an approach probably promising but clearly very complicated,
for which we do not currently have enough thought put into.
Yeah, the problem of understanding what is essential in the caller
state, and what parts of it can be simplified out as irrelevant, seems
the most complicated so far. It is true that using the whole state as a
pre-condition would result in too many summaries. The very nature of
symbolic execution assumes that there are only few classes of
interesting program states.
It seems complicated to use a branch of inlined code as a summary for
future reference, because alpha-renaming becomes much more complex (it
would need to understand the context-less meaning of an in-context
symbolic value, and only then transfer it into another context).
Instead, we'd rather consider creating an auxiliary ExprEngine for
summary collection, as we do now, and put the current caller state into
it "for the reference": before taking a branch, ensure that we do not
contradict the "reference state" after renaming.
Then, yeah, we try to figure out how to handle the differences in the
store, eg. between
(x,0,direct): 0 S32b
...
(x,0,direct): 2015 S32b
...
without creating separate summary branches for each case. Even if the
execution path seems to be determined only by the fact that, say,
reg_$0<x> is in range [0, 10], we cannot be sure that it is absolutely
irrelevant for the checker whether reg_$0<x> is 2 S32b or 3 S32b, until
we ask the checker (for instance, reg_$0<x> may be some sort of enum or
bitfield passed into an API call modelled by this checker inside the
summary graph, and the checker leaves its internal mark in the GDM
regarding its value, so that to take a completely different path later,
but the engine may accidentally consider the difference between 2 S32b
and 3 S32b irrelevant, because all the if's took the same branch so far,
and hence ignore important information and cause false positives.
Finally, i do not instantly see how the "problem of unknowns" is solved
with this approach (as far as i understand, the point was about
non-concrete values rather than about UnknownVal specifically). In a
particular "precedent" context, for which the summary is constructed,
even if it is not the "True" state, some values may still be unknown.
And we'd still need to record and replay our decisions based on those
values when we apply the summary branch. Alternatively, we need make the
"not knowing" some things a part of the pre-condition, but then the
checkers still need to explain which particular things they should "not
know". That is, contexts may not only be contradictory, but also they
may be more specific or less specific, and this also needs to be taken
into account, as in these cases they are significantly different even if
comparable (nested).
=========
Note that we are actually talking about two kinds of support from
checkers: First is about handling alpha-renaming inside their internal
structures in the GDM, second is saving-replaying the
unknown-value-based delayed checks.
However the support for alpha-renaming in checkers doesn't have anything
to do with the summary. In general, we cannot even support the following
IPA-free test without making checker API more complicated:
void foo(FILE *f1, FILE *f2) {
if (f1 == f2) {
fclose(f1);
fclose(f2); // expected-warning{{double close}}
}
}
We can think of alpha-renaming as a purely constraint-solver-side task,
and in this case the checker would need to consult the constraint
manager every time it sees a symbol, to see if it is equal to one of the
symbols it already possesses (hey reg_$0<f2>, are you sure you are not
equal to reg_$1<f1>?), which results in a manual O(N) lookup instead of
regular O(log N) lookup in the GDM maps.
Instead, we can also think of alpha-renaming as a one-time process of
completely eliminating the symbol that was renamed into another symbol
from all parts of the program state, including GDM, and making sure it
doesn't ever reincarnate. Eg., ensure that SValBuilder would not
construct the old symbolic value again, even if asked directly to do so,
but return the new symbolic value instead. This approach seems
preferable to us, because it doesn't require O(N) lookups described
above, and all the necessary code is put in the single callback.
Note that while an "evalRenaming()" callback is necessary to support
alpha-renaming, it is not really scary if this callback is not instantly
implemented in the checker; the checker would just fail in these
situations as it always did. For summary callbacks, this is not the case.
=========
Regarding the heap and invalidation problem. Our current placeholder
approach is that whenever the function does at least any global
invalidation, when applying the summary we run invalidation as if the
whole function was evaluated conservatively, and then write over the
store contents that remain in the summary. This approach is overly
conservative for many reasons. Additionally, it affects alpha-renaming
of SymbolDerived. In our current approach, SymbolDerived always gets
renamed into a SymbolDerived with renamed parent symbol and parent
region, which is also conservative.
I can come up with a much less conservative approach, which assumes
recording and replaying invalidations. For instance, if in a summary we
invalidate the whole heap through an unknown pointer, in the caller
context we may know that this pointer points to a concrete region, and
then SymbolDerived, which appeared during this invalidation, would
resolve into a non-derived symbol. It is still more lightweight than
replaying all stores:
- First, we take the store before the first invalidation, and merge
it;
- Then we take the first invalidation and perform it according to
the context;
- Then we take the store before the second invalidation, and merge it;
- Then we perfor the second invalidation;
- Etc, until we perform all invalidations, and finally take the
final store and merge it.
Then this whole "cascading invalidation" is only good for two things:
(1) Modelling the final store after the call and (2) Properly renaming
SymbolDerived - all other symbols are completely unaffected! We need to
construct the whole invalidation cascade before anything, even before
ranges, so that ranges on SymbolDerived were proper.
The easiest way i see to explain why SymbolDerived is special is to say
that symbols are constant in time, yet often tied to particular moments
of time. Regions are completely "space-like": meaning of a region
doesn't change in time, though we may know more about the region during
the analysis (eg. rename a symbolic region into a concrete region or
into a heap symbolic region). Symbols are, instead, born in a certain
moment of time, and:
--= unlike the regular alpha-renaming, the context switch for summary
(which we earlier referred to as "actualization") can be described as a
time shift, in term of symbol birth times, affecting initially
unpredictable sets of atomic symbols =--
For example, SymbolRegionValue is the value in the store at the
"beginning of analysis" time moment, and when we shift this time moment
to moment of applying the summary, it gets renamed into the value of the
store at this moment, but it is irrelevant if some sort of invalidation
had or had not happened after the call-enter event to which it is now
tied. SymbolConjured is tied to the moment of conjuring, which is
recorded in the form of block count. Actualization assumes time-shifting
this block count.
Similarly, SymbolDerived is tied to the value of the region at the
moment of time in which invalidation occured, and this moment also needs
to be shifted to the relevant moment in the invalidation cascade.
We cannot yet predict how resource-heavy the cascading invalidation
approach would be, and it more or less works with our current dummy
approach, so we can implement the cascading invalidation, or delay it
for later patches.
=========
Regarding interaction between inter-dependent checkers. While we didn't
code this yet, it might be partially naively solved by introducing
dependency graphs for checkers. Eg. a checker that relies on taint
propagation may pull the GenericTaintChecker as its dependency, and then
our checker gets summary-applied after GenericTaintChecker, when the
taint information is already available. Another minor use case for this
feature is that if you explicitly run only one path-sensitive checker
that only registers itself on C++ files, then you can skip pulling core
checkers on C and ObjC files and run only AST checkers for them, saving
analysis time.
While we may think of a single pipeline for all summary events (this
checker marks the file descriptor as open, then that checker marks
malloc as leaked, then this checker marks this symbol as tainted, then
the first checker marks the file descriptor as closed, ...), i think it
is unnecessary for the very same reason: meaning of symbols does not
change through time, so sounds to me as if it's against the very idea of
the symbol to code it this way. Though i may be missing something, it's
an open choice for us.
=========
Below is a roadmap for us now, as we see it. In some cases we need some
help with choosing between a quick and a full-featured solution.
=1= We're done with a full-featured SValVisitor, similar to StmtVisitor,
which would be helpful for alpha-renaming, and probably for many other
things, maybe some checkers would be interested. In order to make sure
it compiles, i made a simple "SVal explainer":
int glob;
void test(int param, void *ptr) {
clang_analyzer_explain(&glob); // expected-warning{{pointer to
variable 'glob'}}
clang_analyzer_explain(param); // expected-warning{{initial value
of variable 'param'}}
clang_analyzer_explain(ptr); // expected-warning{{pointee of
initial value of variable 'ptr'}}
if (param == 42)
clang_analyzer_explain(param); // expected-warning{{signed 32-bit
integer '42'}}
}
This also may be useful if the checker wants to refer to a value in the
warning message in a user-friendly manner. If anybody likes something
like this as a separate commit, i can open a review.
=2= Then we start with alpha-renaming, and we've got the following
problems to deal with:
- Decide if we want to keep renaming inside the constrant manager
or let it work through destroying symbols, as described in this letter;
- Decide how to work with renaming of concrete offsets in store
keys into symbolic+concrete offsets, as described in the previous letter
http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html (the
last part of the miscellaneous section). Possible solutions include
making store keys heavier or reconstructing original regions by offsets
(tried this, very ugly and unreliable, imho; the former solution would
also solve the issue pointed out in
http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html).
Solving this by recording-and-replaying all stores is something we'd try
to avoid as best as we can, i guess.
=3= Decide if we want context-sensitive or context-insensitive
summaries, as discussed above in this letter.
=4= If context-insensitive approach is chosen, implement summaries for
checkers, solve interdependency problems, as discussed above in this
letter.
=5= Implement store actualization and invalidation, either through
simple-conservative approach, or via "cascading" invalidation, as
described above in this letter.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151110/a404f78f/attachment.html>
More information about the cfe-dev
mailing list