[cfe-dev] [analyzer] An idea regarding our data map.
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Tue May 31 09:00:06 PDT 2016
Hello,
On 31.05.2016 16:53, Gábor Horváth wrote:
> Hi!
>
> On 26 May 2016 at 14:56, Artem Dergachev via cfe-dev
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>
> I've got an idea. The idea comes as a replacement for fixing
> current issues with the GDM (sharing traits across checkers,
> automating the routine work on checker side). The idea is: GDM is
> not the right place to model most of the checker-controlled
> program state traits, instead we should encourage checkers to use
> the Store, which already has most of the necessary features.
>
>
> Sharing traits and automating clean-up and other tedious tasks would
> be awesome. However isn't a dependency graph of checkers (to restrict
> the order of execution) is a prerequirement for sharing traits?
Yeah, i guess it's inevitably necessary. For this proof of concept, when
i was playing with SimpleStreamChecker, i put checking into PreCall
("first check our invariant then close the file") and modeling in
post-call ("before the call it's not closed yet"). But i've little hope
that such tricks would work in the generic case.
By the way, forgot to mention after a few rewrites: in all examples i
provide the checker is split into the "modeling" checker (which throws
no reports) and the "checking" checker (which doesn't change the program
state), which interact through the smart traits. PthreadLockChecker
example is split into three parts. So this facility seems to allow
splitting large checkers, like MallocChecker (4000+ lines of code) into
smaller checkers.
> Specifically, whenever the checker tries to model a certain data
> structure, which exists (or could have existed) somewhere in the
> program (be it user or kernel memory, or CPU registers, or shadow
> memory of some sanitizer-like tool which could have been run
> alongside the program), then the checker should model this data
> structure directly in the Store(?)
>
>
> How would you implement this? Wouldn't this kind of approach make the
> store heavier? I think the common case should be that, no extra data
> is associated with the live "objects". Making the common case more
> heavy weight might imply a performance problem.
I just hope that we're logarithmic enough to avoid performance drops due
to that issue. If we're not, then i'd suggest putting most of the
"ghost" objects into some separate bucket, because very often we known
where to find what we're looking for - if we're looking for a ghost
object or for a real object. Finally, performance of
removeDeadBindings() would drop because of the single worklist, but
that's intentional - we need the single worklist. We shouldn't be
suffering from significant memory overhead though, because the amount of
stuff to store is roughly the same anyway.
> Also note that, checkers are not always modelling a data structure. It
> might model a property of a path e.g.: was something locked on this
> path? Of course, you could still attach this kind of information to a
> mutex or something like that, but if you are only curious whether a
> certain method was invoked in a critical session, it is easier to only
> track the lock deepness for each path, this way you only have one
> integer to deal with in the state instead of several on/or switches
> for each mutexes along the paths.
For the lock deepness use case, i'd suggest a single ghost variable that
gets incremented by the checker when something is locked and decremented
when something is unlocked. This variable is essentially a simple data
structure that does not instantly exist in the program, but can probably
be implemented by some dynamic instrumentation technique, so i'm still
fine with imagining such data structure for the purpose of static
analysis. My last experiment with PthreadLockChecker provides an even
more complicated example - it models the stack of mutexes for catching
lock order reversal.
So far the only thing i'm expecting troubles with is constraint-like
information (eg. range constraints on symbols) - which is truly a kind
of meta-information that truly cannot be represented as a data
structure. For such constraints, i've no plan so far but to leave them
in the GDM; see below though for criteria.
> This approach may be thought of as bringing the idea of "ghost
> code" from model checking into our analyzer. We're supposed
> construct ghost structures during analysis, which mimic real or
> imaginary things that happen during program execution.
>
> With the current region hierarchy, it's a bit hard to follow this
> approach, because memory regions are hard to construct. So my
> proposal is to add a few more region classes, which would allow
> checkers to cheaply, probably even transparently, construct the
> necessary structures in the Store.
>
> A few reasons why i ended up liking this idea:
>
> 1. It's natural. After all, the program doesn't have anything like
> GDM when it executes dynamically - it only has memory, which means
> that the Store, being our memory model, should easily take care of
> modeling anything that is going on in the program, and useful
> features of the Store are still useful for such modeling, so why
> don't we use them. In other words, the most natural way to explain
> how exactly should the smart data map automate routine tasks would
> be to say that it should behave exactly like the Store. So why
> don't we just use the Store.
>
>
> They have a lot in common indeed, it is a good observation! However I
> think there are some subtle details. The lifetime of the information
> that is attached to a data structure might not be the same as the
> lifetime of the data structure itself. E.g.: one might use the return
> value of strlen after the argument of strlen is already garbage
> collected. With some checker participation though this might be worked
> around (in fact there are workarounds for this same problem with GDM
> as fas as I remember).
Hmm. That's a problem we seem to automagically solve - behavior of GDM
is very store-like in that case. We used to have SymbolMetadata for
string length, for which liveness is checker-managed. Now, with modeling
string length as ghost object, strlen() returns a SymbolRegionValue of
the ghost object, which lives exactly as long as we need, even if the
ghost object dies.
> 2. There's no problem with sharing the trait. Any checker can see
> what's going on with a plain getSVal(). Of course, we can hide the
> key for getSVal() in anonymous namespace and probably provide
> getters/setters instead, which wouldn't stop the Store from
> iterating over bindings.
>
>
> In order to truly be able to participate, guarantees for execution
> order should be introduced, se one of my previous comments.
Yep.
> 3. There's no problem with routine work and code duplication in
> checkers, which makes writing checkers a lot easier:
> a. Store transparently takes care of cleaning up live/dead
> symbols, there's no more problems with different worklists for
> store and checkers.
>
>
> This would be awesome!
>
> b. The store takes care of invalidation, eg. checkRegionChanges
> could probably be automated with this approach, by simply choosing
> a good parent region for our checker data structures.
>
>
> The invalidation rules might be different for the associated
> information than for the object. E.g.: one might assume that a private
> property is only modified by friends (even tough it is possible in C++
> to modify private members, it is very rare). Is it possible to also
> have custom invalidation schemes for these traits? Or the checkers
> would fall back to GDM in those cases? (You think of this as a GDM
> alternative or replacement?)
I'd agree that invalidation of ghost-fields is so far the most unclear
part of the proposal. Even for SimpleStreamChecker "version 3" it didn't
work very well - we're losing positives on test due to lack of ability
to manage invalidation manually. But that's about pointer-escape, not
region-changes. I've not tested much of the region-changes stuff yet.
I'm thinking more on that, with two ideas - either add some helpers into
the trait objects ("this trait doesn't want to be auto-invalidated by
functions matching certain criteria"), or disable auto-invalidation
completely for ghost-fields (they'd no longer have a symbolic offset,
but some new kind of offset that's neither concrete nor symbolic).
Most importantly, binding to ghost fields shouldn't invalidate regular
fields.
Maybe i'd find out that ghost fields won't work as nicely as i expected,
so i'd have to think of them as "shadow memory" instead - pretend that
ghost values mapped to memregions are actually "outside" memregions. But
i still hope that it's not the case and managing superregions would be
helpful; the 'errno' ghost variable is my primary motivation for this
(we don't always have it in the AST, as it could be macroed into some
`*__errno()', but we still know in which memspace it should reside).
> c. Also, Store provides us with SymbolRegionValue-class and
> SymbolDerived-class symbols for our structures when values in them
> are unknown; this way we can reason if it's the same unknown or a
> probably-different unknown.
>
>
> This is a very interesting property! Tracking this would be automatic?
> Do you know how much extra overhead would that be?
The Store just auto-creates them for regions initialized in a certain
manner (eg. not initialized at all -> create SymbolRegionValue), and i
guess it just shouldn't make an exception for ghost regions.
When this feature is not used, as in my examples with stream checker, we
simply see if the value is constant or not, and we don't care if it's
constant.
There's slight problem with marking the stream state as unknown - right
now i bind an UnknownVal into it. Once we do want to use such feature,
we'd have to call a standard invalidation method on that region. With
GDM, we could simply remove it from GDM - "stop tracking", then "start
tracking" again. But we cannot remove bindings from the Store, because
we want to avoid reincarnation of SymbolRegionValue. This UnknownVal
binding would be automatically removed once the stream dies, but i agree
that it's still some overhead to measure.
> d. The store can be easily dump()'ed, we no longer need hooks
> for printing the program state on checker side.
>
>
> I think still need a way to stringify the metadata we attached to the
> "objects".
Yeah, it'd be nice to have constants mapped to some string values, eg.
right now the dump looks like:
(StreamState{conj_$0<void *>}, 0, Direct): 1 S32b
But it'd be better as:
(StreamState{conj_$0<void *>}, 0, Direct): Open
We could do that by adding a helper method to the state trait, maybe
merge the whole enum into the trait.
>
> e. In the future, we can probably add alpha-renaming, and then
> problems and solutions would most likely be exactly the same for
> the Store itself and for checker data structures.
>
>
> Good point!
>
>
> 4. This approach is very easy to implement and to explain. Most of
> the time, we just use the standard getSVal()/bindLoc() API for
> everything, probably some overrides.
>
>
> If you think of this as an alternative to GDM, a guide explaining when
> to choose the GDM would be awesome.
>
>
> A few arguments against this approach:
>
> 1. Some program state traits do not try to model data structures
> that already exist in the program. The best example is probably
> range constraints. Constraints are immaterial for dynamic
> execution - they're a meta-information about chosen execution
> path. For such traits, i do not yet present a solution, though i
> have a couple of thoughts - if anybody's curious, i can dump. In
> particular, i feel as if constraint-like traits can be easily
> supported with a separate generic approach, and it'd cover all of
> our problems.
>
>
> What kind of ranges are we talking about? Aren't we only store ranges
> for symbols using the constraint manager?
A few more info on this matter, as it seems important. Yeah, the primary
example is ranges for symbols. One more example would be dynamic type
map, but it's a bit more complicated.
So i propose the following reasoning as an approximation for a criteria
of using GDM vs. using Store: it's somehow similar to difference between
symbols and regions. I'm not possessing a perfectly clear understanding
right now, but i'd try my best to explain my thoughts below:
Region contents are essentially mutable. Today they contain one thing,
tomorrow they contain another thing, but it's still the same region. You
can overwrite it, you can move it to another region, you can invalidate
it, you can placement-new on top of it.
Symbols, on the other hand, are essentially immutable. Any information
about the symbol which we managed to gather about the symbol was true
since forever and shall remain true forever. A symbol of range [1, 2]
would never take value '3', and has never taken before, during execution
of the program over that particular branch.
These two concepts - of regions and symbols - work together to make the
analyzer model simple operations.
GDM was designed to let the analyzer model more complex operations.
However, my point is that all GDM traits are either "Store-like", which
means, their keys behave like regions into which we bind values, then
re-bind different values, or they are "Constraint-like", which means
their keys behave like symbols, and information bound to them can only
be expanded, but not mutated.
For instance, despite SimpleStreamChecker's key is a symbol, the trait
is essentially "Store-like": the same file descriptor is first open,
then closed. Which is why i propose first fixing the situation by
providing a region tied to that symbol, and then bind values to
represent file descriptor operations.
Now we're ready to consider the DynamicTypePropagation example. In fact,
dynamic type information sounds like "Constraint-like" trait, because we
can only know more about the dynamic type of every object. However,
currently dynamic type is bound to a region. There's a problem with that
- we can placement-new into the region, and the dynamic type shall
mutate. Because region contents are mutable. So, in my opinion, dynamic
type map in its current implementation is not a well-defined trait - it
should be refactored into two traits, one of which is "Store-like" (bind
an object ID symbol to every region; it can be copied, moved,
invalidated, or affected by placement-new as much as we want), and
another is "Constraint-like" (map object ID symbol to its dynamic type)
- which is truly immutable. And, again, these two concepts -
"Store-like" things and "Constraint-like things" - work together.
So my criterion so far is to move "Store-like" objects to the Store,
where they seem to belong, and keep "Constraint-like" objects in the GDM.
Additionally, I believe that "Constraint-like" objects are always mapped
to symbols (i.e. a map from symbols to something - because you can't tie
them to regions, because region contents are mutable). So i believe that
stuff like liveness or alpha-renaming should be very simple for
"Constraint-like" traits - i.e. never keep alive, just clean up symbols,
merge info on alpha-renaming - and we can still automate everything.
> 2. GDM is done in terms of sets and maps, however Store doesn't
> have sets or maps, it only works with C-like memory with at best
> 1-dimensional arrays. Of course we can model maps of ints to ints
> as ghost arrays with symbolic bindings to elements, and it's a
> safe and sane implementation. For a map from int to set of ints,
> it gets harder. But i think it's a non-issue: it's still more
> important to figure out how do we model more complicated
> containers (i still dream of container symbols and container
> constraints, which would reduce this issue to issue '1.'), and
> then we could re-visit this issue.
>
>
> We should solve this problem without too much execution overhead on
> the common case (no information stored by a checker).
In fact, i'm thinking^W dreaming about the same thing here - container
constraints should be "Constraint-like", tied to container content symbols.
> I've made proof-of-concept patchs to SimpleStreamChecker (two
> different approaches) and PthreadLockChecker. The patch introduces
> four kinds of ghost regions - each "smart trait" owns its own
> memory space which doesn't get invalidated (for things that don't
> need to be invalidated, or for manual control), and allows
> creating one variable inside that (or any other) memory space, one
> field inside any region, or one symbol-based region for any symbol
> (unlike SymbolicRegion, the ghost symbol-based region doesn't need
> to be based on a pointer-type symbol, any symbol will do fine).
>
> The patches are on my github because i'd re-arrange them if i was
> to put them on phabricator, and not sure what to do with multiple
> checker variants. So i'm asking to have a look at the checkers and
> see if you want to write checkers *that* way, instead of the
> traditional GDM, then i'd focus on fixing smaller issues, or i'd
> have to come up with something different.
>
> SimpleStreamCheckerV2 defines symbol-based ghost regions for file
> descriptor symbols and stores constant values (0 or 1) inside
> them, indicating opened and closed streams. It doesn't need to
> store the "unknown" state - it is represented by non-constant
> values, eg. SymbolRegionValue of the ghost region. Link:
> https://githubnValue of the ghost
> r.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d
> <https://github.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d>
>
>
> As a patch against an existing checker this would be easier to
> evaluate what the differences are. But at first glance I did like the
> UX of checker development this provides except for registering the
> trait :)
One sec. Here. Hope these links don't die too soon. Just wasn't sure how
to organize that on github.
Original SimpleStreamChecker vs. SimpleStreamCheckerV2:
https://www.diffchecker.com/a9fqicng
Original SimpleStreamChecker vs. SimpleStreamCheckerV3:
https://www.diffchecker.com/3ketimia
SimpleStreamCheckerV2 vs. SimpleStreamCheckerV3:
https://www.diffchecker.com/ebac2icl
PthreadLockChecker vs. PthreadLockCheckerV2:
https://www.diffchecker.com/mixtsmra
> SimpleStreamCheckerV3 defines a ghost field inside each symbolic
> region corresponding to the each stream's pointer symbol, and
> stores the same constants inside these fields. It has troubles
> suppressing invalidation of these fields, which, in my opinion,
> should ideally be solved through evalcall/bodyfarm of library
> functions. Both versions do not suffer from the "zombie symbols"
> issue. Link:
> https://github.com/haoNoQ/clang/commit/bb63471fa6d70502e61728052b6b7a649c4d80d1
>
> PthreadLockCheckerV2 defines a ghost field inside each mutex,
> indicating one of the four states (Initialized, Locked, Unlocked,
> Destroyed), which allows it to find double locks/unlocks etc.
> Additionally, it models the stack of locks in order to find lock
> order reversals. This is modeled through a single ghost variable
> of type `void**', which represents the stack pointer - the checker
> dereferences this pointer to access or modify the stack and
> increments/decrements it. Link:
> https://github.com/haoNoQ/clang/commit/ab581e4da9c3ef2ddfc0fe75a20ffa9ce789fe8d
>
> None of these checker's code actually deals with ghost regions
> through MemRegionManager - they always use wrappers in the
> ProgramState class, which in my opinion are easy and intuitive,
> though probably more such wrappers need to be introduced.
>
> All the newly introduced terminology is worth discussing, i don't
> really like it.
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160531/8c9e6363/attachment.html>
More information about the cfe-dev
mailing list