[cfe-dev] [analyzer] Summary IPA thoughts

Thu Apr 7 07:06:24 PDT 2016

Hi!

On 10 November 2015 at 17:50, Artem Dergachev via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> Hello Anna,
>
> Thanks a lot for another detailed reply!
>
> It took a while for us to think through the idea of "common-law"
> summaries, as opposed to our current "civil-law" summaries, and it still
> sounds as an approach probably promising but clearly very complicated, for
> which we do not currently have enough thought put into.
>
> Yeah, the problem of understanding what is essential in the caller state,
> and what parts of it can be simplified out as irrelevant, seems the most
> complicated so far. It is true that using the whole state as a
> pre-condition would result in too many summaries. The very nature of
> symbolic execution assumes that there are only few classes of interesting
> program states.
>
> It seems complicated to use a branch of inlined code as a summary for
> future reference, because alpha-renaming becomes much more complex (it
> would need to understand the context-less meaning of an in-context symbolic
> value, and only then transfer it into another context).
>
> Instead, we'd rather consider creating an auxiliary ExprEngine for summary
> collection, as we do now, and put the current caller state into it "for the
> reference": before taking a branch, ensure that we do not contradict the
> "reference state" after renaming.
>
> Then, yeah, we try to figure out how to handle the differences in the
> store, eg. between
>     (x,0,direct): 0 S32b
>     ...
>     (x,0,direct): 2015 S32b
>     ...
> without creating separate summary branches for each case. Even if the
> execution path seems to be determined only by the fact that, say, reg_$0<x>
> is in range [0, 10], we cannot be sure that it is absolutely irrelevant for
> the checker whether reg_$0<x> is 2 S32b or 3 S32b, until we ask the checker
> (for instance, reg_$0<x> may be some sort of enum or bitfield passed into
> an API call modelled by this checker inside the summary graph, and the
> checker leaves its internal mark in the GDM regarding its value, so that to
> take a completely different path later, but the engine may accidentally
> consider the difference between 2 S32b and 3 S32b irrelevant, because all
> the if's took the same branch so far, and hence ignore important
> information and cause false positives.
>

I think there are ways to mitigate or solve this problem. Some of the
checkers can add assumptions. E.g.: when you divide by a value which is
unknown an assumption such that this value is not zero is added to the
state. Each time a checker does something like this, there is a potential
error case there, so that decision might need to be revised in each new
context. It is also possible to tag states by the checkers. Maybe those
tags could be used to help to decide what to do in these scenarios?

>
> Finally, i do not instantly see how the "problem of unknowns" is solved
> with this approach (as far as i understand, the point was about
> non-concrete values rather than about UnknownVal specifically). In a
> particular "precedent" context, for which the summary is constructed, even
> if it is not the "True" state, some values may still be unknown. And we'd
> still need to record and replay our decisions based on those values when we
> apply the summary branch. Alternatively, we need make the "not knowing"
> some things a part of the pre-condition, but then the checkers still need
> to explain which particular things they should "not know". That is,
> contexts may not only be contradictory, but also they may be more specific
> or less specific, and this also needs to be taken into account, as in these
> cases they are significantly different even if comparable (nested).
>
> =========
>
> Note that we are actually talking about two kinds of support from
> checkers: First is about handling alpha-renaming inside their internal
> structures in the GDM, second is saving-replaying the unknown-value-based
> delayed checks.
>
> However the support for alpha-renaming in checkers doesn't have anything
> to do with the summary. In general, we cannot even support the following
> IPA-free test without making checker API more complicated:
>
>   void foo(FILE *f1, FILE *f2) {
>     if (f1 == f2) {
>       fclose(f1);
>       fclose(f2); // expected-warning{{double close}}
>     }
>   }
>
> We can think of alpha-renaming as a purely constraint-solver-side task,
> and in this case the checker would need to consult the constraint manager
> every time it sees a symbol, to see if it is equal to one of the symbols it
> already possesses (hey reg_$0<f2>, are you sure you are not equal to
> reg_$1<f1>?), which results in a manual O(N) lookup instead of regular
> O(log N) lookup in the GDM maps.
>
> Instead, we can also think of alpha-renaming as a one-time process of
> completely eliminating the symbol that was renamed into another symbol from
> all parts of the program state, including GDM, and making sure it doesn't
> ever reincarnate. Eg., ensure that SValBuilder would not construct the old
> symbolic value again, even if asked directly to do so, but return the new
> symbolic value instead. This approach seems preferable to us, because it
> doesn't require O(N) lookups described above, and all the necessary code is
> put in the single callback.
>
> Note that while an "evalRenaming()" callback is necessary to support
> alpha-renaming, it is not really scary if this callback is not instantly
> implemented in the checker; the checker would just fail in these situations
> as it always did. For summary callbacks, this is not the case.
>

I think this might start to get very hard to track. I would prefer a new
kind of symbol that is going to be the subject of actualization. So we
could assert when a symbol that is not supposed to be actualized is being
renamed, or a such symbol is used after summary application without
actualization.

>
> =========
>
> Regarding the heap and invalidation problem. Our current placeholder
> approach is that whenever the function does at least any global
> invalidation, when applying the summary we run invalidation as if the whole
> function was evaluated conservatively, and then write over the store
> contents that remain in the summary. This approach is overly conservative
> for many reasons. Additionally, it affects alpha-renaming of SymbolDerived.
> In our current approach, SymbolDerived always gets renamed into a
> SymbolDerived with renamed parent symbol and parent region, which is also
> conservative.
>
> I can come up with a much less conservative approach, which assumes
> recording and replaying invalidations. For instance, if in a summary we
> invalidate the whole heap through an unknown pointer, in the caller context
> we may know that this pointer points to a concrete region, and then
> SymbolDerived, which appeared during this invalidation, would resolve into
> a non-derived symbol. It is still more lightweight than replaying all
> stores:
>     - First, we take the store before the first invalidation, and merge
> it;
>     - Then we take the first invalidation and perform it according to the
> context;
>     - Then we take the store before the second invalidation, and merge it;
>     - Then we perfor the second invalidation;
>     - Etc, until we perform all invalidations, and finally take the final
> store and merge it.
>
>

> Then this whole "cascading invalidation" is only good for two things: (1)
> Modelling the final store after the call and (2) Properly renaming
> SymbolDerived - all other symbols are completely unaffected! We need to
> construct the whole invalidation cascade before anything, even before
> ranges, so that ranges on SymbolDerived were proper.
>
> The easiest way i see to explain why SymbolDerived is special is to say
> that symbols are constant in time, yet often tied to particular moments of
> time. Regions are completely "space-like": meaning of a region doesn't
> change in time, though we may know more about the region during the
> analysis (eg. rename a symbolic region into a concrete region or into a
> heap symbolic region). Symbols are, instead, born in a certain moment of
> time, and:
>
> --= unlike the regular alpha-renaming, the context switch for summary
> (which we earlier referred to as "actualization") can be described as a
> time shift, in term of symbol birth times, affecting initially
> unpredictable sets of atomic symbols =--
>
> For example, SymbolRegionValue is the value in the store at the "beginning
> of analysis" time moment, and when we shift this time moment to moment of
> applying the summary, it gets renamed into the value of the store at this
> moment, but it is irrelevant if some sort of invalidation had or had not
> happened after the call-enter event to which it is now tied. SymbolConjured
> is tied to the moment of conjuring, which is recorded in the form of block
> count. Actualization assumes time-shifting this block count.
>
> Similarly, SymbolDerived is tied to the value of the region at the moment
> of time in which invalidation occured, and this moment also needs to be
> shifted to the relevant moment in the invalidation cascade.
>
> We cannot yet predict how resource-heavy the cascading invalidation
> approach would be, and it more or less works with our current dummy
> approach, so we can implement the cascading invalidation, or delay it for
> later patches.
>

Basically the invalidation depends on ordering of events, so it requires
timestamps I guess. Also, when a summary is applied, basically the checker
get a list of exploded nodes and it can replay some of the state changes to
based on the context and the summary. But what happens when the checker
needs to do some invalidation or add some assumption on aliasing (that
affects invalidation)? I have the feeling that each time we want to solve
some of these problems we are getting close and closer to a full replay.

>
> =========
>
> Regarding interaction between inter-dependent checkers. While we didn't
> code this yet, it might be partially naively solved by introducing
> dependency graphs for checkers. Eg. a checker that relies on taint
> propagation may pull the GenericTaintChecker as its dependency, and then
> our checker gets summary-applied after GenericTaintChecker, when the taint
> information is already available. Another minor use case for this feature
> is that if you explicitly run only one path-sensitive checker that only
> registers itself on C++ files, then you can skip pulling core checkers on C
> and ObjC files and run only AST checkers for them, saving analysis time.
>

Dependency graph for checkers would be useful regardless of summary based
IPA. I had to do workarounds several times because the lack of this. But
there are more to these dependencies than checkers depending on each other.
Checkers might also add constraints that the core symbolic execution
depends on.

>
> While we may think of a single pipeline for all summary events (this
> checker marks the file descriptor as open, then that checker marks malloc
> as leaked, then this checker marks this symbol as tainted, then the first
> checker marks the file descriptor as closed, ...), i think it is
> unnecessary for the very same reason: meaning of symbols does not change
> through time, so sounds to me as if it's against the very idea of the
> symbol to code it this way. Though i may be missing something, it's an open
> choice for us.
>
> =========
>
> Below is a roadmap for us now, as we see it. In some cases we need some
> help with choosing between a quick and a full-featured solution.
>
> =1= We're done with a full-featured SValVisitor, similar to StmtVisitor,
> which would be helpful for alpha-renaming, and probably for many other
> things, maybe some checkers would be interested. In order to make sure it
> compiles, i made a simple "SVal explainer":
>
>   int glob;
>   void test(int param, void *ptr) {
>     clang_analyzer_explain(&glob); // expected-warning{{pointer to
> variable 'glob'}}
>     clang_analyzer_explain(param); // expected-warning{{initial value of
> variable 'param'}}
>     clang_analyzer_explain(ptr); // expected-warning{{pointee of initial
> value of variable 'ptr'}}
>     if (param == 42)
>       clang_analyzer_explain(param); // expected-warning{{signed 32-bit
> integer '42'}}
>   }
>
> This also may be useful if the checker wants to refer to a value in the
> warning message in a user-friendly manner. If anybody likes something like
> this as a separate commit, i can open a review.
>

Just an update, as far as I remember this is already committed.

>
> =2= Then we start with alpha-renaming, and we've got the following
> problems to deal with:
>     - Decide if we want to keep renaming inside the constrant manager or
> let it work through destroying symbols, as described in this letter;
>     - Decide how to work with renaming of concrete offsets in store keys
> into symbolic+concrete offsets, as described in the previous letter
> http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html (the
> last part of the miscellaneous section). Possible solutions include making
> store keys heavier or reconstructing original regions by offsets (tried
> this, very ugly and unreliable, imho; the former solution would also solve
> the issue pointed out in
> <http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html>
> http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html).
> Solving this by recording-and-replaying all stores is something we'd try to
> avoid as best as we can, i guess.
>

What is the status of this one? Do you have a patch under review?

>
> =3= Decide if we want context-sensitive or context-insensitive summaries,
> as discussed above in this letter.
>

Is there a consensus on this one?

>
> =4= If context-insensitive approach is chosen, implement summaries for
> checkers, solve interdependency problems, as discussed above in this
> letter.
>

> =5= Implement store actualization and invalidation, either through
> simple-conservative approach, or via "cascading" invalidation, as described
> above in this letter.
>

Lets try to summarize what are the tasks:
- Summary apply callback for checkers
- Solve the interdependency problem
- Update all of the checkers?
- Timestamps to solve some aliasing related problems
- Filter confusing assuming.. stuff from path reconstruction
- Figure out how to do leak related checks properly
- Figure out how to do invalidation properly
- Anything I missed?

>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160407/1678a4fa/attachment.html>