<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-text-flowed" style="font-size: 16px;"
lang="x-unicode">Hello Anna,<br>
<br>
Thanks a lot for another detailed reply!<br>
<br>
It took a while for us to think through the idea of "common-law"
summaries, as opposed to our current "civil-law" summaries, and it
still sounds as an approach probably promising but clearly very
complicated, for which we do not currently have enough thought put
into.
<br>
<br>
Yeah, the problem of understanding what is essential in the caller
state, and what parts of it can be simplified out as irrelevant,
seems the most complicated so far. It is true that using the whole
state as a pre-condition would result in too many summaries. The
very nature of symbolic execution assumes that there are only few
classes of interesting program states.
<br>
<br>
It seems complicated to use a branch of inlined code as a summary
for future reference, because alpha-renaming becomes much more
complex (it would need to understand the context-less meaning of
an in-context symbolic value, and only then transfer it into
another context).
<br>
<br>
Instead, we'd rather consider creating an auxiliary ExprEngine for
summary collection, as we do now, and put the current caller state
into it "for the reference": before taking a branch, ensure that
we do not contradict the "reference state" after renaming.
<br>
<br>
Then, yeah, we try to figure out how to handle the differences in
the store, eg. between
<br>
(x,0,direct): 0 S32b
<br>
...
<br>
(x,0,direct): 2015 S32b
<br>
...
<br>
without creating separate summary branches for each case. Even if
the execution path seems to be determined only by the fact that,
say, reg_$0<x> is in range [0, 10], we cannot be sure that
it is absolutely irrelevant for the checker whether
reg_$0<x> is 2 S32b or 3 S32b, until we ask the checker (for
instance, reg_$0<x> may be some sort of enum or bitfield
passed into an API call modelled by this checker inside the
summary graph, and the checker leaves its internal mark in the GDM
regarding its value, so that to take a completely different path
later, but the engine may accidentally consider the difference
between 2 S32b and 3 S32b irrelevant, because all the if's took
the same branch so far, and hence ignore important information and
cause false positives.
<br>
<br>
Finally, i do not instantly see how the "problem of unknowns" is
solved with this approach (as far as i understand, the point was
about non-concrete values rather than about UnknownVal
specifically). In a particular "precedent" context, for which the
summary is constructed, even if it is not the "True" state, some
values may still be unknown. And we'd still need to record and
replay our decisions based on those values when we apply the
summary branch. Alternatively, we need make the "not knowing" some
things a part of the pre-condition, but then the checkers still
need to explain which particular things they should "not know".
That is, contexts may not only be contradictory, but also they may
be more specific or less specific, and this also needs to be taken
into account, as in these cases they are significantly different
even if comparable (nested).<br>
<br>
=========
<br>
<br>
Note that we are actually talking about two kinds of support from
checkers: First is about handling alpha-renaming inside their
internal structures in the GDM, second is saving-replaying the
unknown-value-based delayed checks.
<br>
<br>
However the support for alpha-renaming in checkers doesn't have
anything to do with the summary. In general, we cannot even
support the following IPA-free test without making checker API
more complicated:
<br>
<br>
void foo(FILE *f1, FILE *f2) {
<br>
if (f1 == f2) {
<br>
fclose(f1);
<br>
fclose(f2); // expected-warning{{double close}}
<br>
}
<br>
}
<br>
<br>
We can think of alpha-renaming as a purely constraint-solver-side
task, and in this case the checker would need to consult the
constraint manager every time it sees a symbol, to see if it is
equal to one of the symbols it already possesses (hey
reg_$0<f2>, are you sure you are not equal to
reg_$1<f1>?), which results in a manual O(N) lookup instead
of regular O(log N) lookup in the GDM maps.
<br>
<br>
Instead, we can also think of alpha-renaming as a one-time process
of completely eliminating the symbol that was renamed into another
symbol from all parts of the program state, including GDM, and
making sure it doesn't ever reincarnate. Eg., ensure that
SValBuilder would not construct the old symbolic value again, even
if asked directly to do so, but return the new symbolic value
instead. This approach seems preferable to us, because it doesn't
require O(N) lookups described above, and all the necessary code
is put in the single callback.
<br>
<br>
Note that while an "evalRenaming()" callback is necessary to
support alpha-renaming, it is not really scary if this callback is
not instantly implemented in the checker; the checker would just
fail in these situations as it always did. For summary callbacks,
this is not the case.
<br>
<br>
=========
<br>
<br>
Regarding the heap and invalidation problem. Our current
placeholder approach is that whenever the function does at least
any global invalidation, when applying the summary we run
invalidation as if the whole function was evaluated
conservatively, and then write over the store contents that remain
in the summary. This approach is overly conservative for many
reasons. Additionally, it affects alpha-renaming of SymbolDerived.
In our current approach, SymbolDerived always gets renamed into a
SymbolDerived with renamed parent symbol and parent region, which
is also conservative.
<br>
<br>
I can come up with a much less conservative approach, which
assumes recording and replaying invalidations. For instance, if in
a summary we invalidate the whole heap through an unknown pointer,
in the caller context we may know that this pointer points to a
concrete region, and then SymbolDerived, which appeared during
this invalidation, would resolve into a non-derived symbol. It is
still more lightweight than replaying all stores:
<br>
- First, we take the store before the first invalidation, and
merge it;
<br>
- Then we take the first invalidation and perform it according
to the context;
<br>
- Then we take the store before the second invalidation, and
merge it;
<br>
- Then we perfor the second invalidation;
<br>
- Etc, until we perform all invalidations, and finally take
the final store and merge it.
<br>
<br>
Then this whole "cascading invalidation" is only good for two
things: (1) Modelling the final store after the call and (2)
Properly renaming SymbolDerived - all other symbols are completely
unaffected! We need to construct the whole invalidation cascade
before anything, even before ranges, so that ranges on
SymbolDerived were proper.
<br>
<br>
The easiest way i see to explain why SymbolDerived is special is
to say that symbols are constant in time, yet often tied to
particular moments of time. Regions are completely "space-like":
meaning of a region doesn't change in time, though we may know
more about the region during the analysis (eg. rename a symbolic
region into a concrete region or into a heap symbolic region).
Symbols are, instead, born in a certain moment of time, and:
<br>
<br>
--= unlike the regular alpha-renaming, the context switch for
summary (which we earlier referred to as "actualization") can be
described as a time shift, in term of symbol birth times,
affecting initially unpredictable sets of atomic symbols =--
<br>
<br>
For example, SymbolRegionValue is the value in the store at the
"beginning of analysis" time moment, and when we shift this time
moment to moment of applying the summary, it gets renamed into the
value of the store at this moment, but it is irrelevant if some
sort of invalidation had or had not happened after the call-enter
event to which it is now tied. SymbolConjured is tied to the
moment of conjuring, which is recorded in the form of block count.
Actualization assumes time-shifting this block count.
<br>
<br>
Similarly, SymbolDerived is tied to the value of the region at the
moment of time in which invalidation occured, and this moment also
needs to be shifted to the relevant moment in the invalidation
cascade.
<br>
<br>
We cannot yet predict how resource-heavy the cascading
invalidation approach would be, and it more or less works with our
current dummy approach, so we can implement the cascading
invalidation, or delay it for later patches.
<br>
<br>
=========
<br>
<br>
Regarding interaction between inter-dependent checkers. While we
didn't code this yet, it might be partially naively solved by
introducing dependency graphs for checkers. Eg. a checker that
relies on taint propagation may pull the GenericTaintChecker as
its dependency, and then our checker gets summary-applied after
GenericTaintChecker, when the taint information is already
available. Another minor use case for this feature
is that if you explicitly run only one path-sensitive checker that
only registers itself on C++ files, then you can skip pulling core
checkers on C and ObjC files and run only AST checkers for them,
saving analysis time.
<br>
<br>
While we may think of a single pipeline for all summary events
(this checker marks the file descriptor as open, then that checker
marks malloc as leaked, then this checker marks this symbol as
tainted, then the first checker marks the file descriptor as
closed, ...), i think it is unnecessary for the very same reason:
meaning of symbols does not change through time, so sounds to me
as if it's against the very idea of the symbol to code it this
way. Though i may be missing something, it's an open choice for
us.
<br>
<br>
=========
<br>
<br>
Below is a roadmap for us now, as we see it. In some cases we need
some help with choosing between a quick and a full-featured
solution.<br>
<br>
=1= We're done with a full-featured SValVisitor, similar to
StmtVisitor, which would be helpful for alpha-renaming, and
probably for many other things, maybe some checkers would be
interested. In order to make sure it compiles, i made a simple
"SVal explainer":
<br>
<br>
int glob;
<br>
void test(int param, void *ptr) {
<br>
clang_analyzer_explain(&glob); //
expected-warning{{pointer to variable 'glob'}}
<br>
clang_analyzer_explain(param); // expected-warning{{initial
value of variable 'param'}}
<br>
clang_analyzer_explain(ptr); // expected-warning{{pointee of
initial value of variable 'ptr'}}
<br>
if (param == 42)
<br>
clang_analyzer_explain(param); // expected-warning{{signed
32-bit integer '42'}}
<br>
}
<br>
<br>
This also may be useful if the checker wants to refer to a value
in the warning message in a user-friendly manner. If anybody likes
something like this as a separate commit, i can open a review.<br>
<br>
=2= Then we start with alpha-renaming, and we've got the following
problems to deal with:
<br>
- Decide if we want to keep renaming inside the constrant
manager or let it work through destroying symbols, as described in
this letter;
<br>
- Decide how to work with renaming of concrete offsets in
store keys into symbolic+concrete offsets, as described in the
previous letter <a class="moz-txt-link-freetext"
href="http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html">http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html</a>
(the last part of the miscellaneous section). Possible solutions
include making store keys heavier or reconstructing original
regions by offsets (tried this, very ugly and unreliable, imho;
the former solution would also solve the issue pointed out in <a
class="moz-txt-link-freetext"
href="http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html"><a class="moz-txt-link-freetext" href="http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html">http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html</a></a>).
Solving
this by recording-and-replaying all stores is something we'd try
to avoid as best as we can, i guess.<br>
<br>
=3= Decide if we want context-sensitive or context-insensitive
summaries, as discussed above in this letter.
<br>
<br>
=4= If context-insensitive approach is chosen, implement summaries
for checkers, solve interdependency problems, as discussed above
in this letter.
<br>
<br>
=5= Implement store actualization and invalidation, either through
simple-conservative approach, or via "cascading" invalidation, as
described above in this letter.
<br>
<br>
</div>
</body>
</html>