[cfe-dev] [analyzer] An idea regarding our data map.
Artem Dergachev via cfe-dev
cfe-dev at lists.llvm.org
Thu May 26 05:56:53 PDT 2016
I've got an idea. The idea comes as a replacement for fixing current
issues with the GDM (sharing traits across checkers, automating the
routine work on checker side). The idea is: GDM is not the right place
to model most of the checker-controlled program state traits, instead we
should encourage checkers to use the Store, which already has most of
the necessary features.
Specifically, whenever the checker tries to model a certain data
structure, which exists (or could have existed) somewhere in the program
(be it user or kernel memory, or CPU registers, or shadow memory of some
sanitizer-like tool which could have been run alongside the program),
then the checker should model this data structure directly in the Store(?)
This approach may be thought of as bringing the idea of "ghost code"
from model checking into our analyzer. We're supposed construct ghost
structures during analysis, which mimic real or imaginary things that
happen during program execution.
With the current region hierarchy, it's a bit hard to follow this
approach, because memory regions are hard to construct. So my proposal
is to add a few more region classes, which would allow checkers to
cheaply, probably even transparently, construct the necessary structures
in the Store.
A few reasons why i ended up liking this idea:
1. It's natural. After all, the program doesn't have anything like GDM
when it executes dynamically - it only has memory, which means that the
Store, being our memory model, should easily take care of modeling
anything that is going on in the program, and useful features of the
Store are still useful for such modeling, so why don't we use them. In
other words, the most natural way to explain how exactly should the
smart data map automate routine tasks would be to say that it should
behave exactly like the Store. So why don't we just use the Store.
2. There's no problem with sharing the trait. Any checker can see what's
going on with a plain getSVal(). Of course, we can hide the key for
getSVal() in anonymous namespace and probably provide getters/setters
instead, which wouldn't stop the Store from iterating over bindings.
3. There's no problem with routine work and code duplication in
checkers, which makes writing checkers a lot easier:
a. Store transparently takes care of cleaning up live/dead symbols,
there's no more problems with different worklists for store and checkers.
b. The store takes care of invalidation, eg. checkRegionChanges could
probably be automated with this approach, by simply choosing a good
parent region for our checker data structures.
c. Also, Store provides us with SymbolRegionValue-class and
SymbolDerived-class symbols for our structures when values in them are
unknown; this way we can reason if it's the same unknown or a
probably-different unknown.
d. The store can be easily dump()'ed, we no longer need hooks for
printing the program state on checker side.
e. In the future, we can probably add alpha-renaming, and then
problems and solutions would most likely be exactly the same for the
Store itself and for checker data structures.
4. This approach is very easy to implement and to explain. Most of the
time, we just use the standard getSVal()/bindLoc() API for everything,
probably some overrides.
A few arguments against this approach:
1. Some program state traits do not try to model data structures that
already exist in the program. The best example is probably range
constraints. Constraints are immaterial for dynamic execution - they're
a meta-information about chosen execution path. For such traits, i do
not yet present a solution, though i have a couple of thoughts - if
anybody's curious, i can dump. In particular, i feel as if
constraint-like traits can be easily supported with a separate generic
approach, and it'd cover all of our problems.
2. GDM is done in terms of sets and maps, however Store doesn't have
sets or maps, it only works with C-like memory with at best
1-dimensional arrays. Of course we can model maps of ints to ints as
ghost arrays with symbolic bindings to elements, and it's a safe and
sane implementation. For a map from int to set of ints, it gets harder.
But i think it's a non-issue: it's still more important to figure out
how do we model more complicated containers (i still dream of container
symbols and container constraints, which would reduce this issue to
issue '1.'), and then we could re-visit this issue.
I've made proof-of-concept patchs to SimpleStreamChecker (two different
approaches) and PthreadLockChecker. The patch introduces four kinds of
ghost regions - each "smart trait" owns its own memory space which
doesn't get invalidated (for things that don't need to be invalidated,
or for manual control), and allows creating one variable inside that (or
any other) memory space, one field inside any region, or one
symbol-based region for any symbol (unlike SymbolicRegion, the ghost
symbol-based region doesn't need to be based on a pointer-type symbol,
any symbol will do fine).
The patches are on my github because i'd re-arrange them if i was to put
them on phabricator, and not sure what to do with multiple checker
variants. So i'm asking to have a look at the checkers and see if you
want to write checkers *that* way, instead of the traditional GDM, then
i'd focus on fixing smaller issues, or i'd have to come up with
something different.
SimpleStreamCheckerV2 defines symbol-based ghost regions for file
descriptor symbols and stores constant values (0 or 1) inside them,
indicating opened and closed streams. It doesn't need to store the
"unknown" state - it is represented by non-constant values, eg.
SymbolRegionValue of the ghost region. Link:
https://github.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d
SimpleStreamCheckerV3 defines a ghost field inside each symbolic region
corresponding to the each stream's pointer symbol, and stores the same
constants inside these fields. It has troubles suppressing invalidation
of these fields, which, in my opinion, should ideally be solved through
evalcall/bodyfarm of library functions. Both versions do not suffer from
the "zombie symbols" issue. Link:
https://github.com/haoNoQ/clang/commit/bb63471fa6d70502e61728052b6b7a649c4d80d1
PthreadLockCheckerV2 defines a ghost field inside each mutex, indicating
one of the four states (Initialized, Locked, Unlocked, Destroyed), which
allows it to find double locks/unlocks etc. Additionally, it models the
stack of locks in order to find lock order reversals. This is modeled
through a single ghost variable of type `void**', which represents the
stack pointer - the checker dereferences this pointer to access or
modify the stack and increments/decrements it. Link:
https://github.com/haoNoQ/clang/commit/ab581e4da9c3ef2ddfc0fe75a20ffa9ce789fe8d
None of these checker's code actually deals with ghost regions through
MemRegionManager - they always use wrappers in the ProgramState class,
which in my opinion are easy and intuitive, though probably more such
wrappers need to be introduced.
All the newly introduced terminology is worth discussing, i don't really
like it.
More information about the cfe-dev
mailing list