[cfe-dev] [analyzer] Summary IPA thoughts

Fri Apr 8 02:39:38 PDT 2016

Hello,

Just before i forget - timestamps-like problems arise not only in 
summary, but also in generic alpha-renaming. Consider:

int x;
void f1(int *y) {
   x = 1;
   *y = 2;
   if (y == &x) {
     // We need to merge the two store bindings:
     //   (x, 0, default): 1 S32b
     //   (SymRegion{reg_$0<y>}, 0, default): 2 S32b;
     // into one, and it'd better be this one:
     //   (x, 0, default): 2 S32b
   }
}

struct S { int x, y; }
void f2(struct S *s, struct S *t) {
   s->x = 1;
   t->y = 2;
   if (s == t) {
     // But here we need to merge the base regions
     // keeping both bindings alive:
     //   (SymRegion{reg_$0<s>}, 0, direct): 1;
     //   (SymRegion{reg_$0<s>}, 32, direct): 2;
     // I guess i'd leave LazyCompoundVal cases
     // as a simple excercise for the reader.
   }
}

void f3(FILE *fp1, FILE *fp2) {
   fclose(fp1);
   fclose(fp2);
   if (fp1 == fp2) {
     // Optionally throw check-after-use double-close?
     // Ignoring wouldn't hurt though.
   }
}

void f4(pthread_mutex_t *m, pthread_mutex_t *n) {
   pthread_mutex_lock(m);
   pthread_mutex_unlock(n);
   pthread_mutex_lock(m);
   pthread_mutex_trylock(n);
   // etc.
   if (m == n) {
     // Should we reconstruct the whole sequence
     // of locks and unlocks by timestamps here?
     // In order to see if there's double-anything
     // or the branch is unreachable or whatever.
     // Data map should be responsible this time.
   }
}

void f5(int *x, int *y) {
   if (*x == 1)
     if (*y == 2)
       if (x == y) {
         // And this time we surely need to sink.
         // Constraint manager should detect it.
       }
}

void f6(char *x, char *y) {
   if (strlen(x) == 3)
     if (strlen(y) == 4)
       if (x == y) {
         // Sink here as well.
         // Data map should detect it.
       }
}

On 07.04.2016 17:06, Gábor Horváth wrote:
>
>     Then, yeah, we try to figure out how to handle the differences in
>     the store, eg. between
>         (x,0,direct): 0 S32b
>         ...
>         (x,0,direct): 2015 S32b
>         ...
>     without creating separate summary branches for each case. Even if
>     the execution path seems to be determined only by the fact that,
>     say, reg_$0<x> is in range [0, 10], we cannot be sure that it is
>     absolutely irrelevant for the checker whether reg_$0<x> is 2 S32b
>     or 3 S32b, until we ask the checker (for instance, reg_$0<x> may
>     be some sort of enum or bitfield passed into an API call modelled
>     by this checker inside the summary graph, and the checker leaves
>     its internal mark in the GDM regarding its value, so that to take
>     a completely different path later, but the engine may accidentally
>     consider the difference between 2 S32b and 3 S32b irrelevant,
>     because all the if's took the same branch so far, and hence ignore
>     important information and cause false positives.
>
>
> I think there are ways to mitigate or solve this problem. Some of the 
> checkers can add assumptions. E.g.: when you divide by a value which 
> is unknown an assumption such that this value is not zero is added to 
> the state. Each time a checker does something like this, there is a 
> potential error case there, so that decision might need to be revised 
> in each new context. It is also possible to tag states by the 
> checkers. Maybe those tags could be used to help to decide what to do 
> in these scenarios?

Yeah, there are a lot of different APIs we can provide, which do not 
explicitly ask the checker to support summary, but instead ask it to 
provide more info to the core on what he's doing. I'd delay this 
decision until we settle with how the data map should look like. Though 
- unlike the examples above - i've still not seeing how exactly it'd 
work and what kind of API we'd need, keeping in mind that it shouldn't 
restrict the fantasy of checker authors too much.

>     Instead, we can also think of alpha-renaming as a one-time process
>     of completely eliminating the symbol that was renamed into another
>     symbol from all parts of the program state, including GDM, and
>     making sure it doesn't ever reincarnate. Eg., ensure that
>     SValBuilder would not construct the old symbolic value again, even
>     if asked directly to do so, but return the new symbolic value
>     instead. This approach seems preferable to us, because it doesn't
>     require O(N) lookups described above, and all the necessary code
>     is put in the single callback.
>
>     Note that while an "evalRenaming()" callback is necessary to
>     support alpha-renaming, it is not really scary if this callback is
>     not instantly implemented in the checker; the checker would just
>     fail in these situations as it always did. For summary callbacks,
>     this is not the case.
>
>
> I think this might start to get very hard to track. I would prefer a 
> new kind of symbol that is going to be the subject of actualization. 
> So we could assert when a symbol that is not supposed to be actualized 
> is being renamed, or a such symbol is used after summary application 
> without actualization.

Hmm. You mean like if the checker forgets to rename his symbol, it'd be 
hard to notice and debug? Then i'm not seeing how a new symbol class 
would help: we either need to mark all symbols in the caller state as 
renamed, or all symbols in the callee state as not yet renamed, but both 
approaches require iterating through the state, including the checker 
data, and as such are not much easier than actually conducting the 
renaming properly.

But the way i see how our summary currently works, it wasn't hard to 
keep an eye on all symbols.

>     Similarly, SymbolDerived is tied to the value of the region at the
>     moment of time in which invalidation occured, and this moment also
>     needs to be shifted to the relevant moment in the invalidation
>     cascade.
>
>     We cannot yet predict how resource-heavy the cascading
>     invalidation approach would be, and it more or less works with our
>     current dummy approach, so we can implement the cascading
>     invalidation, or delay it for later patches.
>
>
> Basically the invalidation depends on ordering of events, so it 
> requires timestamps I guess. Also, when a summary is applied, 
> basically the checker get a list of exploded nodes and it can replay 
> some of the state changes to based on the context and the summary. But 
> what happens when the checker needs to do some invalidation or add 
> some assumption on aliasing (that affects invalidation)? I have the 
> feeling that each time we want to solve some of these problems we are 
> getting close and closer to a full replay.

With such "cascading invalidation", we replay only conservatively 
evaluated calls. That's extra work, but hopefully not too much. On the 
other hand, now that i think of it, this time i'd really have a look at 
our hierarchy - probably i already forgot what i was thinking, but i 
feel that by making SymbolConjured or SymbolDerived carry extra data 
with them, we could probably avoid this problem without cascading 
replays. Needs more thought.

>     Regarding interaction between inter-dependent checkers. While we
>     didn't code this yet, it might be partially naively solved by
>     introducing dependency graphs for checkers. Eg. a checker that
>     relies on taint propagation may pull the GenericTaintChecker as
>     its dependency, and then our checker gets summary-applied after
>     GenericTaintChecker, when the taint information is already
>     available. Another minor use case for this feature is that if you
>     explicitly run only one path-sensitive checker that only registers
>     itself on C++ files, then you can skip pulling core checkers on C
>     and ObjC files and run only AST checkers for them, saving analysis
>     time.
>
>
> Dependency graph for checkers would be useful regardless of summary 
> based IPA. I had to do workarounds several times because the lack of 
> this. But there are more to these dependencies than checkers depending 
> on each other. Checkers might also add constraints that the core 
> symbolic execution depends on.

Yep.

>     This also may be useful if the checker wants to refer to a value
>     in the warning message in a user-friendly manner. If anybody likes
>     something like this as a separate commit, i can open a review.
>
>
> Just an update, as far as I remember this is already committed.

Yep.

>     =2= Then we start with alpha-renaming, and we've got the following
>     problems to deal with:
>         - Decide if we want to keep renaming inside the constrant
>     manager or let it work through destroying symbols, as described in
>     this letter;
>         - Decide how to work with renaming of concrete offsets in
>     store keys into symbolic+concrete offsets, as described in the
>     previous letter
>     http://lists.llvm.org/pipermail/cfe-dev/2015-October/045690.html
>     (the last part of the miscellaneous section). Possible solutions
>     include making store keys heavier or reconstructing original
>     regions by offsets (tried this, very ugly and unreliable, imho;
>     the former solution would also solve the issue pointed out in
>     http://lists.llvm.org/pipermail/cfe-dev/2015-October/045704.html).
>     Solving this by recording-and-replaying all stores is something
>     we'd try to avoid as best as we can, i guess.
>
>
> What is the status of this one? Do you have a patch under review?

Nope, not yet, got distracted... Also, we really need the smart data map 
before this, it makes things a lot easier and without it a lot of 
progress is stalled, so i'd certainly work on that first. A few thoughts 
on the approach to alpha-renaming i'm having in mind:

1. For every renaming situation, create a specific partial renamer 
object inherited from FullSValVisitor<RetTy = Optional<SVal>> that 
contains information about what exactly is being renamed. For example, 
if we rename SymExpr 'x' to SymExpr 'y', this visitor would return 
nonloc::SymbolVal 'y' for SymExpr 'x' and a missing optional value for 
everything else. For summary, it would conduct the stack frame context 
shift on stack memory spaces and do whatever needs to be done with 
SymbolConjured, SymbolDerived, etc.

2. Create a generic renamer inherited from FullSValVisitor<RetTy = SVal> 
(not optional this time!) that consumes an instance of the partial 
renamer. The generic renamer would traverse the SVal recursively, and 
trust his partial renamer instance above anything else on every item 
that the partial renamer successfully renames. Then it tries to 
re-construct bigger SVal's after smaller SVal's were renamed by the 
partial renamer. If the partial renamer changes nothing (even if it 
always returns a valid SVal), then the original SVal should remain 
unchanged, i guess.

3. Then i'd probably go ahead and traverse the state to do replacements. 
This section depends on how the smart data map gets done.

>     =3= Decide if we want context-sensitive or context-insensitive
>     summaries, as discussed above in this letter.
>
>
> Is there a consensus on this one?

Nope, not yet. But all the earlier machinery seems to be pretty much the 
same for both approaches, as far as i can see. There's one thing i'm 
pretty sure about though (disclaimer - when i'm sure, it usually means 
i'm going to say something very stupid): if we want to eventually 
implement inter-unit with serialized summaries, then these summaries 
would better be context-insensitive, otherwise how do we know what 
contexts would we need. Well, we could make a complicated IPC system in 
which different instances of clang constantly ask each other for 
context-sensitive summaries of functions they're looking for in 
particular contexts, but, hmm, huh, dunno really...

> Lets try to summarize what are the tasks:
> - Summary apply callback for checkers
> - Solve the interdependency problem
> - Update all of the checkers?
> - Timestamps to solve some aliasing related problems
> - Filter confusing assuming.. stuff from path reconstruction
> - Figure out how to do leak related checks properly
> - Figure out how to do invalidation properly
> - Anything I missed?

There are also some FIXME tests in test/Analysis/summary-ipa-xfail.cpp , 
i don't really remember what's going on here. I guess it has something 
to do with complicated passing-by-reference, and temporary objects, and 
compound values. Otherwise, as far as i remember, that's it in case 
you'd like to take our experimental code and fix it.

For leaks, if i understand correctly, there isn't much action required. 
If a symbol leaks inside the callee, it couldn't have existed in the 
caller. If a symbol appears inside the callee but doesn't leak there, 
then it'd be collected by the caller anyway. Different statuses like 
"pointer-escaped" should be applied by the checker during application of 
the summary. So i've a feeling that no special action is required to 
handle that. At most, we'd need to handle liveness of sutff from 
different stack frame contexts, as right now we collect dead symbols 
properly on the end of function (while still inside the function), but 
with summary we cannot do that. So it's worth having a look, but i'm not 
sure it's actually broken.

Best regards,
Artem.