[cfe-dev] [analyzer] An idea regarding our data map.

Tue May 31 09:00:06 PDT 2016

Hello,

On 31.05.2016 16:53, Gábor Horváth wrote:
> Hi!
>
> On 26 May 2016 at 14:56, Artem Dergachev via cfe-dev 
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>
>     I've got an idea. The idea comes as a replacement for fixing
>     current issues with the GDM (sharing traits across checkers,
>     automating the routine work on checker side). The idea is: GDM is
>     not the right place to model most of the checker-controlled
>     program state traits, instead we should encourage checkers to use
>     the Store, which already has most of the necessary features.
>
>
> Sharing traits and automating clean-up and other tedious tasks would 
> be awesome. However isn't a dependency graph of checkers (to restrict 
> the order of execution) is a prerequirement for sharing traits?

Yeah, i guess it's inevitably necessary. For this proof of concept, when 
i was playing with SimpleStreamChecker, i put checking into PreCall 
("first check our invariant then close the file") and modeling in 
post-call ("before the call it's not closed yet"). But i've little hope 
that such tricks would work in the generic case.

By the way, forgot to mention after a few rewrites: in all examples i 
provide the checker is split into the "modeling" checker (which throws 
no reports) and the "checking" checker (which doesn't change the program 
state), which interact through the smart traits. PthreadLockChecker 
example is split into three parts. So this facility seems to allow 
splitting large checkers, like MallocChecker (4000+ lines of code) into 
smaller checkers.

>     Specifically, whenever the checker tries to model a certain data
>     structure, which exists (or could have existed) somewhere in the
>     program (be it user or kernel memory, or CPU registers, or shadow
>     memory of some sanitizer-like tool which could have been run
>     alongside the program), then the checker should model this data
>     structure directly in the Store(?)
>
>
> How would you implement this? Wouldn't this kind of approach make the 
> store heavier? I think the common case should be that, no extra data 
> is associated with the live "objects". Making the common case more 
> heavy weight might imply a performance problem.

I just hope that we're logarithmic enough to avoid performance drops due 
to that issue. If we're not, then i'd suggest putting most of the 
"ghost" objects into some separate bucket, because very often we known 
where to find what we're looking for - if we're looking for a ghost 
object or for a real object. Finally, performance of 
removeDeadBindings() would drop because of the single worklist, but 
that's intentional - we need the single worklist. We shouldn't be 
suffering from significant memory overhead though, because the amount of 
stuff to store is roughly the same anyway.

> Also note that, checkers are not always modelling a data structure. It 
> might model a property of a path e.g.: was something locked on this 
> path? Of course, you could still attach this kind of information to a 
> mutex or something like that, but if you are only curious whether a 
> certain method was invoked in a critical session, it is easier to only 
> track the lock deepness for each path, this way you only have one 
> integer to deal with in the state instead of several on/or switches 
> for each mutexes along the paths.

For the lock deepness use case, i'd suggest a single ghost variable that 
gets incremented by the checker when something is locked and decremented 
when something is unlocked. This variable is essentially a simple data 
structure that does not instantly exist in the program, but can probably 
be implemented by some dynamic instrumentation technique, so i'm still 
fine with imagining such data structure for the purpose of static 
analysis. My last experiment with PthreadLockChecker provides an even 
more complicated example - it models the stack of mutexes for catching 
lock order reversal.

So far the only thing i'm expecting troubles with is constraint-like 
information (eg. range constraints on symbols) - which is truly a kind 
of meta-information that truly cannot be represented as a data 
structure. For such constraints, i've no plan so far but to leave them 
in the GDM; see below though for criteria.

>     This approach may be thought of as bringing the idea of "ghost
>     code" from model checking into our analyzer. We're supposed
>     construct ghost structures during analysis, which mimic real or
>     imaginary things that happen during program execution.
>
>     With the current region hierarchy, it's a bit hard to follow this
>     approach, because memory regions are hard to construct. So my
>     proposal is to add a few more region classes, which would allow
>     checkers to cheaply, probably even transparently, construct the
>     necessary structures in the Store.
>
>     A few reasons why i ended up liking this idea:
>
>     1. It's natural. After all, the program doesn't have anything like
>     GDM when it executes dynamically - it only has memory, which means
>     that the Store, being our memory model, should easily take care of
>     modeling anything that is going on in the program, and useful
>     features of the Store are still useful for such modeling, so why
>     don't we use them. In other words, the most natural way to explain
>     how exactly should the smart data map automate routine tasks would
>     be to say that it should behave exactly like the Store. So why
>     don't we just use the Store.
>
>
> They have a lot in common indeed, it is a good observation! However I 
> think there are some subtle details. The lifetime of the information 
> that is attached to a data structure might not be the same as the 
> lifetime of the data structure itself. E.g.: one might use the return 
> value of strlen after the argument of strlen is already garbage 
> collected. With some checker participation though this might be worked 
> around (in fact there are workarounds for this same problem with GDM 
> as fas as I remember).

Hmm. That's a problem we seem to automagically solve - behavior of GDM 
is very store-like in that case. We used to have SymbolMetadata for 
string length, for which liveness is checker-managed. Now, with modeling 
string length as ghost object, strlen() returns a SymbolRegionValue of 
the ghost object, which lives exactly as long as we need, even if the 
ghost object dies.

>     2. There's no problem with sharing the trait. Any checker can see
>     what's going on with a plain getSVal(). Of course, we can hide the
>     key for getSVal() in anonymous namespace and probably provide
>     getters/setters instead, which wouldn't stop the Store from
>     iterating over bindings.
>
>
> In order to truly be able to participate, guarantees for execution 
> order should be introduced, se one of my previous comments.

Yep.

>     3. There's no problem with routine work and code duplication in
>     checkers, which makes writing checkers a lot easier:
>       a. Store transparently takes care of cleaning up live/dead
>     symbols, there's no more problems with different worklists for
>     store and checkers.
>
>
> This would be awesome!
>
>       b. The store takes care of invalidation, eg. checkRegionChanges
>     could probably be automated with this approach, by simply choosing
>     a good parent region for our checker data structures.
>
>
> The invalidation rules might be different for the associated 
> information than for the object. E.g.: one might assume that a private 
> property is only modified by friends (even tough it is possible in C++ 
> to modify private members, it is very rare). Is it possible to also 
> have custom invalidation schemes for these traits? Or the checkers 
> would fall back to GDM in those cases? (You think of this as a GDM 
> alternative or replacement?)

I'd agree that invalidation of ghost-fields is so far the most unclear 
part of the proposal. Even for SimpleStreamChecker "version 3" it didn't 
work very well - we're losing positives on test due to lack of ability 
to manage invalidation manually. But that's about pointer-escape, not 
region-changes. I've not tested much of the region-changes stuff yet.

I'm thinking more on that, with two ideas - either add some helpers into 
the trait objects ("this trait doesn't want to be auto-invalidated by 
functions matching certain criteria"), or disable auto-invalidation 
completely for ghost-fields (they'd no longer have a symbolic offset, 
but some new kind of offset that's neither concrete nor symbolic).

Most importantly, binding to ghost fields shouldn't invalidate regular 
fields.

Maybe i'd find out that ghost fields won't work as nicely as i expected, 
so i'd have to think of them as "shadow memory" instead - pretend that 
ghost values mapped to memregions are actually "outside" memregions. But 
i still hope that it's not the case and managing superregions would be 
helpful; the 'errno' ghost variable is my primary motivation for this 
(we don't always have it in the AST, as it could be macroed into some 
`*__errno()', but we still know in which memspace it should reside).

>       c. Also, Store provides us with SymbolRegionValue-class and
>     SymbolDerived-class symbols for our structures when values in them
>     are unknown; this way we can reason if it's the same unknown or a
>     probably-different unknown.
>
>
> This is a very interesting property! Tracking this would be automatic? 
> Do you know how much extra overhead would that be?

The Store just auto-creates them for regions initialized in a certain 
manner (eg. not initialized at all -> create SymbolRegionValue), and i 
guess it just shouldn't make an exception for ghost regions.

When this feature is not used, as in my examples with stream checker, we 
simply see if the value is constant or not, and we don't care if it's 
constant.

There's slight problem with marking the stream state as unknown - right 
now i bind an UnknownVal into it. Once we do want to use such feature, 
we'd have to call a standard invalidation method on that region. With 
GDM, we could simply remove it from GDM - "stop tracking", then "start 
tracking" again. But we cannot remove bindings from the Store, because 
we want to avoid reincarnation of SymbolRegionValue. This UnknownVal 
binding would be automatically removed once the stream dies, but i agree 
that it's still some overhead to measure.

>       d. The store can be easily dump()'ed, we no longer need hooks
>     for printing the program state on checker side.
>
>
> I think still need a way to stringify the metadata we attached to the 
> "objects".

Yeah, it'd be nice to have constants mapped to some string values, eg. 
right now the dump looks like:

     (StreamState{conj_$0<void *>}, 0, Direct): 1 S32b

But it'd be better as:

     (StreamState{conj_$0<void *>}, 0, Direct): Open

We could do that by adding a helper method to the state trait, maybe 
merge the whole enum into the trait.
>
>       e. In the future, we can probably add alpha-renaming, and then
>     problems and solutions would most likely be exactly the same for
>     the Store itself and for checker data structures.
>
>
> Good point!
>
>
>     4. This approach is very easy to implement and to explain. Most of
>     the time, we just use the standard getSVal()/bindLoc() API for
>     everything, probably some overrides.
>
>
> If you think of this as an alternative to GDM, a guide explaining when 
> to choose the GDM would be awesome.
>
>
>     A few arguments against this approach:
>
>     1. Some program state traits do not try to model data structures
>     that already exist in the program. The best example is probably
>     range constraints. Constraints are immaterial for dynamic
>     execution - they're a meta-information about chosen execution
>     path. For such traits, i do not yet present a solution, though i
>     have a couple of thoughts - if anybody's curious, i can dump. In
>     particular, i feel as if constraint-like traits can be easily
>     supported with a separate generic approach, and it'd cover all of
>     our problems.
>
>
> What kind of ranges are we talking about? Aren't we only store ranges 
> for symbols using the constraint manager?

A few more info on this matter, as it seems important. Yeah, the primary 
example is ranges for symbols. One more example would be dynamic type 
map, but it's a bit more complicated.

So i propose the following reasoning as an approximation for a criteria 
of using GDM vs. using Store: it's somehow similar to difference between 
symbols and regions. I'm not possessing a perfectly clear understanding 
right now, but i'd try my best to explain my thoughts below:

Region contents are essentially mutable. Today they contain one thing, 
tomorrow they contain another thing, but it's still the same region. You 
can overwrite it, you can move it to another region, you can invalidate 
it, you can placement-new on top of it.

Symbols, on the other hand, are essentially immutable. Any information 
about the symbol which we managed to gather about the symbol was true 
since forever and shall remain true forever. A symbol of range [1, 2] 
would never take value '3', and has never taken before, during execution 
of the program over that particular branch.

These two concepts - of regions and symbols - work together to make the 
analyzer model simple operations.

GDM was designed to let the analyzer model more complex operations. 
However, my point is that all GDM traits are either "Store-like", which 
means, their keys behave like regions into which we bind values, then 
re-bind different values, or they are "Constraint-like", which means 
their keys behave like symbols, and information bound to them can only 
be expanded, but not mutated.

For instance, despite SimpleStreamChecker's key is a symbol, the trait 
is essentially "Store-like": the same file descriptor is first open, 
then closed. Which is why i propose first fixing the situation by 
providing a region tied to that symbol, and then bind values to 
represent file descriptor operations.

Now we're ready to consider the DynamicTypePropagation example. In fact, 
dynamic type information sounds like "Constraint-like" trait, because we 
can only know more about the dynamic type of every object. However, 
currently dynamic type is bound to a region. There's a problem with that 
- we can placement-new into the region, and the dynamic type shall 
mutate. Because region contents are mutable. So, in my opinion, dynamic 
type map in its current implementation is not a well-defined trait - it 
should be refactored into two traits, one of which is "Store-like" (bind 
an object ID symbol to every region; it can be copied, moved, 
invalidated, or affected by placement-new as much as we want), and 
another is "Constraint-like" (map object ID symbol to its dynamic type) 
- which is truly immutable. And, again, these two concepts - 
"Store-like" things and "Constraint-like things" - work together.

So my criterion so far is to move "Store-like" objects to the Store, 
where they seem to belong, and keep "Constraint-like" objects in the GDM.

Additionally, I believe that "Constraint-like" objects are always mapped 
to symbols (i.e. a map from symbols to something - because you can't tie 
them to regions, because region contents are mutable). So i believe that 
stuff like liveness or alpha-renaming should be very simple for 
"Constraint-like" traits - i.e. never keep alive, just clean up symbols, 
merge info on alpha-renaming - and we can still automate everything.

>     2. GDM is done in terms of sets and maps, however Store doesn't
>     have sets or maps, it only works with C-like memory with at best
>     1-dimensional arrays. Of course we can model maps of ints to ints
>     as ghost arrays with symbolic bindings to elements, and it's a
>     safe and sane implementation. For a map from int to set of ints,
>     it gets harder. But i think it's a non-issue: it's still more
>     important to figure out how do we model more complicated
>     containers (i still dream of container symbols and container
>     constraints, which would reduce this issue to issue '1.'), and
>     then we could re-visit this issue.
>
>
> We should solve this problem without too much execution overhead on 
> the common case (no information stored by a checker).

In fact, i'm thinking^W dreaming about the same thing here - container 
constraints should be "Constraint-like", tied to container content symbols.

>     I've made proof-of-concept patchs to SimpleStreamChecker (two
>     different approaches) and PthreadLockChecker. The patch introduces
>     four kinds of ghost regions - each "smart trait" owns its own
>     memory space which doesn't get invalidated (for things that don't
>     need to be invalidated, or for manual control), and allows
>     creating one variable inside that (or any other) memory space, one
>     field inside any region, or one symbol-based region for any symbol
>     (unlike SymbolicRegion, the ghost symbol-based region doesn't need
>     to be based on a pointer-type symbol, any symbol will do fine).
>
>     The patches are on my github because i'd re-arrange them if i was
>     to put them on phabricator, and not sure what to do with multiple
>     checker variants. So i'm asking to have a look at the checkers and
>     see if you want to write checkers *that* way, instead of the
>     traditional GDM, then i'd focus on fixing smaller issues, or i'd
>     have to come up with something different.
>
>     SimpleStreamCheckerV2 defines symbol-based ghost regions for file
>     descriptor symbols and stores constant values (0 or 1) inside
>     them, indicating opened and closed streams. It doesn't need to
>     store the "unknown" state - it is represented by non-constant
>     values, eg. SymbolRegionValue of the ghost region. Link:
>     https://githubnValue of the ghost
>     r.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d
>     <https://github.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d>
>
>
> As a patch against an existing checker this would be easier to 
> evaluate what the differences are. But at first glance I did like the 
> UX of checker development this provides except for registering the 
> trait :)

One sec. Here. Hope these links don't die too soon. Just wasn't sure how 
to organize that on github.

Original SimpleStreamChecker vs. SimpleStreamCheckerV2:
https://www.diffchecker.com/a9fqicng

Original SimpleStreamChecker vs. SimpleStreamCheckerV3:
https://www.diffchecker.com/3ketimia

SimpleStreamCheckerV2 vs. SimpleStreamCheckerV3:
https://www.diffchecker.com/ebac2icl

PthreadLockChecker vs. PthreadLockCheckerV2:
https://www.diffchecker.com/mixtsmra

>     SimpleStreamCheckerV3 defines a ghost field inside each symbolic
>     region corresponding to the each stream's pointer symbol, and
>     stores the same constants inside these fields. It has troubles
>     suppressing invalidation of these fields, which, in my opinion,
>     should ideally be solved through evalcall/bodyfarm of library
>     functions. Both versions do not suffer from the "zombie symbols"
>     issue. Link:
>     https://github.com/haoNoQ/clang/commit/bb63471fa6d70502e61728052b6b7a649c4d80d1
>
>     PthreadLockCheckerV2 defines a ghost field inside each mutex,
>     indicating one of the four states (Initialized, Locked, Unlocked,
>     Destroyed), which allows it to find double locks/unlocks etc.
>     Additionally, it models the stack of locks in order to find lock
>     order reversals. This is modeled through a single ghost variable
>     of type `void**', which represents the stack pointer - the checker
>     dereferences this pointer to access or modify the stack and
>     increments/decrements it. Link:
>     https://github.com/haoNoQ/clang/commit/ab581e4da9c3ef2ddfc0fe75a20ffa9ce789fe8d
>
>     None of these checker's code actually deals with ghost regions
>     through MemRegionManager - they always use wrappers in the
>     ProgramState class, which in my opinion are easy and intuitive,
>     though probably more such wrappers need to be introduced.
>
>     All the newly introduced terminology is worth discussing, i don't
>     really like it.
>     _______________________________________________
>     cfe-dev mailing list
>     cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160531/8c9e6363/attachment.html>