[cfe-dev] [analyzer] Summary IPA thoughts

Tue Nov 10 08:50:55 PST 2015

Hello Anna,

Thanks a lot for another detailed reply!

It took a while for us to think through the idea of "common-law" 
summaries, as opposed to our current "civil-law" summaries, and it still 
sounds as an approach probably promising but clearly very complicated, 
for which we do not currently have enough thought put into.

Yeah, the problem of understanding what is essential in the caller 
state, and what parts of it can be simplified out as irrelevant, seems 
the most complicated so far. It is true that using the whole state as a 
pre-condition would result in too many summaries. The very nature of 
symbolic execution assumes that there are only few classes of 
interesting program states.

It seems complicated to use a branch of inlined code as a summary for 
future reference, because alpha-renaming becomes much more complex (it 
would need to understand the context-less meaning of an in-context 
symbolic value, and only then transfer it into another context).

Instead, we'd rather consider creating an auxiliary ExprEngine for 
summary collection, as we do now, and put the current caller state into 
it "for the reference": before taking a branch, ensure that we do not 
contradict the "reference state" after renaming.

Then, yeah, we try to figure out how to handle the differences in the 
store, eg. between
     (x,0,direct): 0 S32b
     ...
     (x,0,direct): 2015 S32b
     ...
without creating separate summary branches for each case. Even if the 
execution path seems to be determined only by the fact that, say, 
reg_$0<x> is in range [0, 10], we cannot be sure that it is absolutely 
irrelevant for the checker whether reg_$0<x> is 2 S32b or 3 S32b, until 
we ask the checker (for instance, reg_$0<x> may be some sort of enum or 
bitfield passed into an API call modelled by this checker inside the 
summary graph, and the checker leaves its internal mark in the GDM 
regarding its value, so that to take a completely different path later, 
but the engine may accidentally consider the difference between 2 S32b 
and 3 S32b irrelevant, because all the if's took the same branch so far, 
and hence ignore important information and cause false positives.

Finally, i do not instantly see how the "problem of unknowns" is solved 
with this approach (as far as i understand, the point was about 
non-concrete values rather than about UnknownVal specifically). In a 
particular "precedent" context, for which the summary is constructed, 
even if it is not the "True" state, some values may still be unknown. 
And we'd still need to record and replay our decisions based on those 
values when we apply the summary branch. Alternatively, we need make the 
"not knowing" some things a part of the pre-condition, but then the 
checkers still need to explain which particular things they should "not 
know". That is, contexts may not only be contradictory, but also they 
may be more specific or less specific, and this also needs to be taken 
into account, as in these cases they are significantly different even if 
comparable (nested).

=========

Note that we are actually talking about two kinds of support from 
checkers: First is about handling alpha-renaming inside their internal 
structures in the GDM, second is saving-replaying the 
unknown-value-based delayed checks.

However the support for alpha-renaming in checkers doesn't have anything 
to do with the summary. In general, we cannot even support the following 
IPA-free test without making checker API more complicated:

   void foo(FILE *f1, FILE *f2) {
     if (f1 == f2) {
       fclose(f1);
       fclose(f2); // expected-warning{{double close}}
     }
   }

We can think of alpha-renaming as a purely constraint-solver-side task, 
and in this case the checker would need to consult the constraint 
manager every time it sees a symbol, to see if it is equal to one of the 
symbols it already possesses (hey reg_$0<f2>, are you sure you are not 
equal to reg_$1<f1>?), which results in a manual O(N) lookup instead of 
regular O(log N) lookup in the GDM maps.

Instead, we can also think of alpha-renaming as a one-time process of 
completely eliminating the symbol that was renamed into another symbol 
from all parts of the program state, including GDM, and making sure it 
doesn't ever reincarnate. Eg., ensure that SValBuilder would not 
construct the old symbolic value again, even if asked directly to do so, 
but return the new symbolic value instead. This approach seems 
preferable to us, because it doesn't require O(N) lookups described 
above, and all the necessary code is put in the single callback.

Note that while an "evalRenaming()" callback is necessary to support 
alpha-renaming, it is not really scary if this callback is not instantly 
implemented in the checker; the checker would just fail in these 
situations as it always did. For summary callbacks, this is not the case.

=========

Regarding the heap and invalidation problem. Our current placeholder 
approach is that whenever the function does at least any global 
invalidation, when applying the summary we run invalidation as if the 
whole function was evaluated conservatively, and then write over the 
store contents that remain in the summary. This approach is overly 
conservative for many reasons. Additionally, it affects alpha-renaming 
of SymbolDerived. In our current approach, SymbolDerived always gets 
renamed into a SymbolDerived with renamed parent symbol and parent 
region, which is also conservative.

I can come up with a much less conservative approach, which assumes 
recording and replaying invalidations. For instance, if in a summary we 
invalidate the whole heap through an unknown pointer, in the caller 
context we may know that this pointer points to a concrete region, and 
then SymbolDerived, which appeared during this invalidation, would 
resolve into a non-derived symbol. It is still more lightweight than 
replaying all stores:
     - First, we take the store before the first invalidation, and merge 
it;
     - Then we take the first invalidation and perform it according to 
the context;
     - Then we take the store before the second invalidation, and merge it;
     - Then we perfor the second invalidation;
     - Etc, until we perform all invalidations, and finally take the 
final store and merge it.

Then this whole "cascading invalidation" is only good for two things: 
(1) Modelling the final store after the call and (2) Properly renaming 
SymbolDerived - all other symbols are completely unaffected! We need to 
construct the whole invalidation cascade before anything, even before 
ranges, so that ranges on SymbolDerived were proper.

The easiest way i see to explain why SymbolDerived is special is to say 
that symbols are constant in time, yet often tied to particular moments 
of time. Regions are completely "space-like": meaning of a region 
doesn't change in time, though we may know more about the region during 
the analysis (eg. rename a symbolic region into a concrete region or 
into a heap symbolic region). Symbols are, instead, born in a certain 
moment of time, and:

--= unlike the regular alpha-renaming, the context switch for summary 
(which we earlier referred to as "actualization") can be described as a 
time shift, in term of symbol birth times, affecting initially 
unpredictable sets of atomic symbols =--

For example, SymbolRegionValue is the value in the store at the 
"beginning of analysis" time moment, and when we shift this time moment 
to moment of applying the summary, it gets renamed into the value of the 
store at this moment, but it is irrelevant if some sort of invalidation 
had or had not happened after the call-enter event to which it is now 
tied. SymbolConjured is tied to the moment of conjuring, which is 
recorded in the form of block count. Actualization assumes time-shifting 
this block count.

Similarly, SymbolDerived is tied to the value of the region at the 
moment of time in which invalidation occured, and this moment also needs 
to be shifted to the relevant moment in the invalidation cascade.

We cannot yet predict how resource-heavy the cascading invalidation 
approach would be, and it more or less works with our current dummy 
approach, so we can implement the cascading invalidation, or delay it 
for later patches.

=========

Regarding interaction between inter-dependent checkers. While we didn't 
code this yet, it might be partially naively solved by introducing 
dependency graphs for checkers. Eg. a checker that relies on taint 
propagation may pull the GenericTaintChecker as its dependency, and then 
our checker gets summary-applied after GenericTaintChecker, when the 
taint information is already available. Another minor use case for this 
feature is that if you explicitly run only one path-sensitive checker 
that only registers itself on C++ files, then you can skip pulling core 
checkers on C and ObjC files and run only AST checkers for them, saving 
analysis time.

While we may think of a single pipeline for all summary events (this 
checker marks the file descriptor as open, then that checker marks 
malloc as leaked, then this checker marks this symbol as tainted, then 
the first checker marks the file descriptor as closed, ...), i think it 
is unnecessary for the very same reason: meaning of symbols does not 
change through time, so sounds to me as if it's against the very idea of 
the symbol to code it this way. Though i may be missing something, it's 
an open choice for us.

=========

Below is a roadmap for us now, as we see it. In some cases we need some 
help with choosing between a quick and a full-featured solution.

=1= We're done with a full-featured SValVisitor, similar to StmtVisitor, 
which would be helpful for alpha-renaming, and probably for many other 
things, maybe some checkers would be interested. In order to make sure 
it compiles, i made a simple "SVal explainer":

   int glob;
   void test(int param, void *ptr) {
     clang_analyzer_explain(&glob); // expected-warning{{pointer to 
variable 'glob'}}
     clang_analyzer_explain(param); // expected-warning{{initial value 
of variable 'param'}}
     clang_analyzer_explain(ptr); // expected-warning{{pointee of 
initial value of variable 'ptr'}}
     if (param == 42)
       clang_analyzer_explain(param); // expected-warning{{signed 32-bit 
integer '42'}}
   }

This also may be useful if the checker wants to refer to a value in the 
warning message in a user-friendly manner. If anybody likes something 
like this as a separate commit, i can open a review.

=2= Then we start with alpha-renaming, and we've got the following 
problems to deal with:
     - Decide if we want to keep renaming inside the constrant manager 
or let it work through destroying symbols, as described in this letter;
     - Decide how to work with renaming of concrete offsets in store 
keys into symbolic+concrete offsets, as described in the previous letter 
http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html (the 
last part of the miscellaneous section). Possible solutions include 
making store keys heavier or reconstructing original regions by offsets 
(tried this, very ugly and unreliable, imho; the former solution would 
also solve the issue pointed out in 
http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html). 
Solving this by recording-and-replaying all stores is something we'd try 
to avoid as best as we can, i guess.

=3= Decide if we want context-sensitive or context-insensitive 
summaries, as discussed above in this letter.

=4= If context-insensitive approach is chosen, implement summaries for 
checkers, solve interdependency problems, as discussed above in this 
letter.

=5= Implement store actualization and invalidation, either through 
simple-conservative approach, or via "cascading" invalidation, as 
described above in this letter.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151110/a404f78f/attachment.html>