[cfe-dev] [analyzer] Zombie symbols.

Mon Mar 28 05:00:03 PDT 2016

 > I really want to spend some time here, but that code is pretty
 > difficult to understand (as you noted)

Umm, before i forget, i guess i'd dump some quick explanation of what i 
figured out about it.

First, we have symbols-or-regions, hereinafter "values" unless it causes 
confusion. Liveness of some values may be determined regardless of any 
other values: for example, a VarRegion of a local variable can be proven 
live via live variables analysis, which is an auxiliary data flow 
analysis that doesn't depend on our symbolic execution engine. However, 
values may also be related to other values via a "keeps-alive relation". 
There are two kind of keep-alive relations:

I. "first-class" keep-alive relations arise from and rely purely on the 
SVal hierarchy (for example, SymbolRegionValue is kept alive by its 
parent region, SymbolicRegion keeps its base symbol alive).

II. "second-class" keep-alive relations arise when symbols and regions 
appear in certain parts of the program state (for example, any region 
would keep alive its contents - all symbols stored inside it).

The purpose of removeDead() is to understand which values present in the 
current program state are dead (and, as a useful by-product, which 
values are live) by exploring the keep-alive relations between values; 
second-class relations are explored with respect to the current program 
state. Whenever a certain value X is determined to be live, any values 
kept alive by X are also known to be live. The liveness information is 
collected from scratch on every pass and does not depend on previous 
removeDead() runs. Once the liveness information is collected, a special 
action is taken on all dead values, which may involve removing them from 
the state, but does not necessarily boil down to just that.

SymbolReaper is an object that contains the incomplete understanding of 
live values on the mark phase, which eventually becomes complete and as 
such can be used on the sweep phase. SymbolReaper's data consists of two 
sets of values - the "live set" and the "dead set". The live set 
contains values already known to be live. The dead set contains the 
values found in the state, liveness of which is explicitly questioned 
and was not yet determined; values may move from the dead set to the 
live set (if they turn out to be live due to a previously unconsidered 
second-class relation to a live value) but not in the opposite 
direction. First-class relations are mostly hardcoded directly into the 
SymbolReaper - see isLive(), isLiveRegion(). Originally, SymbolReaper 
was simply a pair of sets, but later it was extended to take care of 
first-class keep-alive relations.

The removeDead() procedure begins by creating a temporary SymbolReaper 
object that lives for as long as the procedure runs. Then it proceeds 
with querying different manager objects responsible for different 
sections of the program state to explore their section and fill-in the 
liveness information in the SymbolReaper. Each section needs to classify 
each value it is responsible for as live or dead.

Program state sections are queried in the following order:

(1) GDM traits of particular checkers (through CheckerManager and 
checkLiveSymbols() callback).
(2) Environment (through EnvironmentManager).
(3) Store (through StoreManager).
(4) Range constraints (through ConstraintManager).

Section (1) is implemented in each checker in a different manner. The 
checker needs to be aware of the fact that if it doesn't markLive() a 
value here, then all information related to it may be removed from other 
sections of the state (eg. range constraints for symbols or bindings for 
regions). And if the checker doesn't mark any value maybeDead(), it may 
never receive the callback when this value actually dies.

/*
[Q1]: In http://reviews.llvm.org/D12726 you mentioned that checkers mark 
liveness after the store, but now you say it's just the opposite, how so?
[A1]: I was wrong and stay corrected. It seems that checkers do not have 
access to any liveness information obtained from the rest of the state 
for during checkLiveSymbols().
*/

Sections (2), (3), (4) combine(!!) their sweep phase with the mark 
phase. However, for checker traits, the sweep phase is delayed until the 
checkDeadSymbols() checker callback, which goes last at the end of 
removeDead() after (4).

/*
[Q2]: Does taint information for dead symbols ever get removed? Cause 
it's in GDM, but it's neither checkers nor range constraints.
[A2]: Afaik, no; we don't even have State::removeTaint() yet. This is a 
bug, which may probably cause performance problems with taint analysis, 
though maybe not too much, because tainted symbols are rare.
*/

Section (2) scans active expressions, marks values of active expressions 
as live. It also marks values of inactive expressions that are still 
found in the state as dead and removes these expressions and their 
values from the environment. It does not need to rely on any liveness 
information for values to make such decision, which is why this step is 
completely independent and isolated. Note, hovewer, that first-class 
relations play an important role here - any values found inside values 
that were marked dead or live are also explicitly marked dead or live, 
which contributes to complexity of the code. The visitor used here to 
conduct marking of sub-values is not re-used.

Section (3) is the most complex here. It takes care of the following 
second-class keep-alive relation: for every binding in the store, the 
binding key keeps alive the binding value. Both binding keys and binding 
values are (at least, partially/sometimes) "values" (in the 
symbols-or-regions sense), hence it may happen that a certain binding 
value X is also a binding key for another binding value Y, and once 
liveness of X is established, liveness of Y needs to be re-calculated. 
Liveness of binding pairs cannot thus be computed in a single pass 
through the store, and therefore a worklist-like algorithm is 
implemented, that tries to reach a fixed point in which the liveness 
information no longer gets updated. Once the worklist algorithm 
finishes, the following happens: all bindings with dead keys are removed 
from the store, all sub-values of binding keys and all binding values 
and their sub-values are marked as maybe-dead. The visiting algorithm 
for sub-values is not re-used from the Environment.

The method assumes that after the end of (3) no new live marks are 
added. Which is why Store is capable of removing dead bindings without 
requiring a separate pass.

Section (4) is trivial - the constraint manager does not mark any 
symbols live. It marks all constraint keys that were not previously 
proven live as maybe-dead and removes them from the constraint map.

/*
[Q3]: So what if we try to fix the issue mentioned in [Q1] by putting 
(1) after (4)?
[A3]: It's more complicated than that. Checkers may introduce their own 
second-class keep-alive relations between values. For example, 
CStringChecker would keep string length alive for as long as the string 
region is alive. There may be more such relations introduced by other 
checkers. It may happen that a certain checker-specific second-class 
keep-alive relation would cause a store binding key to become live. 
Hence, there needs to be a single worklist algorithm run for both 
store-related second-class keep-alive relations and checker-specific 
second-class keep-alive relations, and (1) needs to be essentially 
merged with (4). And that would make things yet even more complex.
*/