[cfe-dev] [analyzer] An idea regarding our data map.

Thu May 26 05:56:53 PDT 2016

I've got an idea. The idea comes as a replacement for fixing current 
issues with the GDM (sharing traits across checkers, automating the 
routine work on checker side). The idea is: GDM is not the right place 
to model most of the checker-controlled program state traits, instead we 
should encourage checkers to use the Store, which already has most of 
the necessary features.

Specifically, whenever the checker tries to model a certain data 
structure, which exists (or could have existed) somewhere in the program 
(be it user or kernel memory, or CPU registers, or shadow memory of some 
sanitizer-like tool which could have been run alongside the program), 
then the checker should model this data structure directly in the Store(?)

This approach may be thought of as bringing the idea of "ghost code" 
from model checking into our analyzer. We're supposed construct ghost 
structures during analysis, which mimic real or imaginary things that 
happen during program execution.

With the current region hierarchy, it's a bit hard to follow this 
approach, because memory regions are hard to construct. So my proposal 
is to add a few more region classes, which would allow checkers to 
cheaply, probably even transparently, construct the necessary structures 
in the Store.

A few reasons why i ended up liking this idea:

1. It's natural. After all, the program doesn't have anything like GDM 
when it executes dynamically - it only has memory, which means that the 
Store, being our memory model, should easily take care of modeling 
anything that is going on in the program, and useful features of the 
Store are still useful for such modeling, so why don't we use them. In 
other words, the most natural way to explain how exactly should the 
smart data map automate routine tasks would be to say that it should 
behave exactly like the Store. So why don't we just use the Store.

2. There's no problem with sharing the trait. Any checker can see what's 
going on with a plain getSVal(). Of course, we can hide the key for 
getSVal() in anonymous namespace and probably provide getters/setters 
instead, which wouldn't stop the Store from iterating over bindings.

3. There's no problem with routine work and code duplication in 
checkers, which makes writing checkers a lot easier:
   a. Store transparently takes care of cleaning up live/dead symbols, 
there's no more problems with different worklists for store and checkers.
   b. The store takes care of invalidation, eg. checkRegionChanges could 
probably be automated with this approach, by simply choosing a good 
parent region for our checker data structures.
   c. Also, Store provides us with SymbolRegionValue-class and 
SymbolDerived-class symbols for our structures when values in them are 
unknown; this way we can reason if it's the same unknown or a 
probably-different unknown.
   d. The store can be easily dump()'ed, we no longer need hooks for 
printing the program state on checker side.
   e. In the future, we can probably add alpha-renaming, and then 
problems and solutions would most likely be exactly the same for the 
Store itself and for checker data structures.

4. This approach is very easy to implement and to explain. Most of the 
time, we just use the standard getSVal()/bindLoc() API for everything, 
probably some overrides.

A few arguments against this approach:

1. Some program state traits do not try to model data structures that 
already exist in the program. The best example is probably range 
constraints. Constraints are immaterial for dynamic execution - they're 
a meta-information about chosen execution path. For such traits, i do 
not yet present a solution, though i have a couple of thoughts - if 
anybody's curious, i can dump. In particular, i feel as if 
constraint-like traits can be easily supported with a separate generic 
approach, and it'd cover all of our problems.

2. GDM is done in terms of sets and maps, however Store doesn't have 
sets or maps, it only works with C-like memory with at best 
1-dimensional arrays. Of course we can model maps of ints to ints as 
ghost arrays with symbolic bindings to elements, and it's a safe and 
sane implementation. For a map from int to set of ints, it gets harder. 
But i think it's a non-issue: it's still more important to figure out 
how do we model more complicated containers (i still dream of container 
symbols and container constraints, which would reduce this issue to 
issue '1.'), and then we could re-visit this issue.

I've made proof-of-concept patchs to SimpleStreamChecker (two different 
approaches) and PthreadLockChecker. The patch introduces four kinds of 
ghost regions - each "smart trait" owns its own memory space which 
doesn't get invalidated (for things that don't need to be invalidated, 
or for manual control), and allows creating one variable inside that (or 
any other) memory space, one field inside any region, or one 
symbol-based region for any symbol (unlike SymbolicRegion, the ghost 
symbol-based region doesn't need to be based on a pointer-type symbol, 
any symbol will do fine).

The patches are on my github because i'd re-arrange them if i was to put 
them on phabricator, and not sure what to do with multiple checker 
variants. So i'm asking to have a look at the checkers and see if you 
want to write checkers *that* way, instead of the traditional GDM, then 
i'd focus on fixing smaller issues, or i'd have to come up with 
something different.

SimpleStreamCheckerV2 defines symbol-based ghost regions for file 
descriptor symbols and stores constant values (0 or 1) inside them, 
indicating opened and closed streams. It doesn't need to store the 
"unknown" state - it is represented by non-constant values, eg. 
SymbolRegionValue of the ghost region. Link: 
https://github.com/haoNoQ/clang/commit/578e79009bbac81c129daff19e14dbad16e2e43d

SimpleStreamCheckerV3 defines a ghost field inside each symbolic region 
corresponding to the each stream's pointer symbol, and stores the same 
constants inside these fields. It has troubles suppressing invalidation 
of these fields, which, in my opinion, should ideally be solved through 
evalcall/bodyfarm of library functions. Both versions do not suffer from 
the "zombie symbols" issue. Link: 
https://github.com/haoNoQ/clang/commit/bb63471fa6d70502e61728052b6b7a649c4d80d1

PthreadLockCheckerV2 defines a ghost field inside each mutex, indicating 
one of the four states (Initialized, Locked, Unlocked, Destroyed), which 
allows it to find double locks/unlocks etc. Additionally, it models the 
stack of locks in order to find lock order reversals. This is modeled 
through a single ghost variable of type `void**', which represents the 
stack pointer - the checker dereferences this pointer to access or 
modify the stack and increments/decrements it. Link: 
https://github.com/haoNoQ/clang/commit/ab581e4da9c3ef2ddfc0fe75a20ffa9ce789fe8d

None of these checker's code actually deals with ghost regions through 
MemRegionManager - they always use wrappers in the ProgramState class, 
which in my opinion are easy and intuitive, though probably more such 
wrappers need to be introduced.

All the newly introduced terminology is worth discussing, i don't really 
like it.