[cfe-dev] A survey of dataflow analyses in Clang

Sun Oct 4 12:42:58 PDT 2020

Hi!

In an earlier email I wrote about how LivenessAnalysis works in Clang [1],
and hinted at how the static analyzer takes advantage of it. I incorrectly
advertised Liveness and UninitializedVariables to be the only two dataflow
analyses written in Clang, whereas consumed annotation checking [2] and
thread safety analysis [3] are also utiliting dataflow. Lifetime analysis
hasn't landed yet [24-28], but is also great to study.

The point of this mail is to take an objective look at these algorithms in
terms of their "dataflowness", identify their similarities and differences,
and what techniques they implement that could benefit all of them. To be
clear, dataflow describes a wide ranges of analyses -- we're looking at
flow sensitive, intraprocedural ones.

I invite you to point out any and all the incomplet and inkorrek items in
this survey.

---=== The many struggles of C++ data flow analysis in Clang ===---

Not all data flow struggles are C/C++, or Clang specific, but some are, and
universal difficulties (aliasing, for instance) might be obfuscated by them
more than usual. The following is a non-comprehensive list of such issues
all data flow algorithms in Clang face.

---  Reads/writes ---

A common theme among data flow algorithms is looking for gens (generations,
or in other words "reads" or "loads") or kills (in other words "writes" or
"stores") of either variables or values. When we say GEN or KILL set, we
mean a set of variables generated, or killed at a given statement. GEN and
KILL sets can be defined for CFGBlocks as well from the sets of the
contained statements. Many data flow analyses are defined with a fixpoint
algorithm on GEN/KILL sets. In literature, we assume that these sets are
precalculated, but that is rarely a case for Clang.

In literature, most data flow algorithms are defined and showcased on
either some simple pseudo code, or an instruction set in a three address
form:

x <- y + z

This is a very significant simplification, as the appearance of a
variable's in Clang's AST (DeclRefExprs) is not very telling of its
context. [10, 11, 13] all mention this a very significant difficulty in
Clang: only the surrounding context, and even that with great difficulties,
can tell whether a variable is written or read. LLVM IR is largely void of
this problem, but in [14] Chris Lattner provides a lengthy reasoning why
it'd be next to impossible to lower to LLVM IR, conduct analysis, but still
produce diagnostics in the original source code. LLVM IR is still tempting
enough though that we have ongoing projects to get some use out of it [15].

This poses another problem: in literature, transfer functions are
responsible for the flow of dataflow information through a statement. For
liveness, a transfer function for "x = a;" would tell that "a" should be
added to liveness set, and "x" should be removed. In clang however, where
transfer functions are implemented with statement visitors, the fact that
we need substantial contextual information for any given statement is
sometimes a problem.

There is a push for an MLIR dialect for C++. You can find a discussion in
[19], and there will also be talk on the next dev meeting, but that will
happen after this mail [17, 18].

--- Classic data flow issues: Aliasing, function calls, unfeasible
execution paths ---

Aliasing is an often discussed, unavoidable obstacle for C/C++, and
intraprocedural of most dataflow algorithms is also problematic.

int a = 5;
int *p = &a;
foo(a, p);
// foo can not know that using the second parameter affects the first.
// Also: will foo write these arguments? Will it read them?

This naturally allows a third category next to gen/kill: unknown. On such a
statement, the variable may be read or written, maybe neither, maybe both.
Literature solves this by:
* Overestimating: If we intend to emit diagnostics for values computed but
never read (dead values), we say that it is not read only if it is killed
without ever being read, or occuring is such an unknown statement (we
overestimate the liveness sets),
* Underestimating: For uninitialized variable misuse, we only look for
variables that were read, but never killed up to that point or appeared in
an unknown context (we underestimate the uninitialized variable sets).
However, in practice, its worth keeping track of such unknown events. Some
analyses go further, and assign a sort of confidence level: "we're not
absolutely sure whether this is kill, but its reasonable to believe that it
isn't".

Unfeasible execution paths are also an inherent problem for dataflow.

if (a)
  x = 0;
if (!a)
  10 / x;

Sure the flow of control will never go through both of these then branches
(in a single threaded environment), but that is an undecidable problem in
dataflow. The solution here is the same as above, under or over
approximation. The foundation of this lies in lattice theory, which is
explained in [30, 31].

--- Records, fields, methods ---

Suppose you need to conduct data flow analysis on the following code in
clang, and need to identify variables in the code:

void A::foo(int u) {
  int i;
  this->a = 5;
  S s;
  s.a->get().b = 6;
  (*s.a).c = 9;
  // ...
}

The parameter "u" and "i" are easy enough, they are both VarDecls, and so
is the implicit this parameter. "s" is also a VarDecl, but its fields
aren't, they are FieldDelcs. An option would be to keep record of a chain
of fields -- for s.a.b.x.z, a pair like this would indeed identify which
memory region we're describing: (s, [a, b, x, z]). However, there are a lot
of ways to obfuscate this specific memory region, for example, if we ever
support some aliasing the variety of ways dereferencing can be expressed
will pose a challenge. Simply put, its difficult to compare expressions
[16].

---=== 1. LivenessAnalysis ===---

Liveness answers the question whether value computed at one point will ever
be used before being overwritten:

int a = 5;
a = 6;
if (rand())
  process(a);
//...

The value 5 from 'a' will never be read, meaning that 'a' is not live from
the point of its initialization until it is assigned 6. After that point it
is live until the next time it's assigned, because there exists a path from
the assignment to a read of 'a' without any interleaving assignments.

The product of the analysis is liveness sets (sets of variables or sets of
expressions). They are calculated for each statement, and transitively,
each CFGBlock. If a variable/expression is found in this set, it means that
they are live (will be read from later) at that point. The algorithm is
described and implemented as a
* Backward analysis: whether a value will be read is a property of the
future
* May analysis: we're interested whether it *could* be read from, not
whether its guaranteed. Coupled with the fact that infeasible execution
paths are also analyzed, this will overapproxiate the actual liveness sets.

It was mentioned above that transfer functions require considerable context
to know whether a statement writes/reads a variable. Transfer functions for
DeclRefExpr that look for variables usages have a problem, namely that by
the time a statement visitor reaches a DeclRefExpr the read/write
information is already lost. So, how does Liveness solve this issue? Does
it implement a pass that categorizes DeclRefExprs before the main analysis
runs? It turns out, the fact that Liveness is a backward analysis allows it
to gather this information on the go [34]:

For the following code snippet:

void f() {
  int i;

  i = 5;
}

This is the generated CFG:

void f()
 [B2 (ENTRY)]
   Succs (1): B1

 [B1]
   1: int i;
   2: 5
   3: i
   4: [B1.3] = [B1.2]
   Preds (1): B2
   Succs (1): B0

 [B0 (EXIT)]
   Preds (1): B1

In B1, the 4th element, which is the assignment, is visited before the 3rd,
where the written variable is, meaning that by the time the transfer
function for DeclRefExpr is ran, Liveness could've already noted it as a
write.

The analysis, as described in [4], which Clang follows closely, ignores the
worst of the read/write problem. Only variable declarations and assignments
are regarded as writes. If a DeclRefExpr pops up in any other context, it
is seen as a read, even if its passed variables to function call as a
non-const reference, etc. Also, the algorithm doesn't respect aliasing, and
doesn't regard the fields of a record as a distinct, writable memory
regions. However, there is still value in the analysis interpreted this
way. The Static Analyzer conducts liveness analysis with some relaxations
[1], and its results are used, with a few other techniques, to know how
long a variable's memory region needs to be kept in the Environment. This
simplicity also allows the algorithm to have a very simple implementation.

The way the Static Analyzer utilizes this analysis requires clang to keep
the liveness sets for a while, making it the only dataflow analysis that
doesn't immediately discard its results. This might explain the choice of
data structures, which is, contrary to what many literature suggests
(bitvectors), are ImmutableSets. Since these sets are accessed far more
regularly then the result of other analyses, its a non-trivial claim that
either solution is faster than the other.

[1] is purely dedicated to Clang's Liveness, and is a great source for
further reading.

--- Strengths / Deficiencies ---

+ Straightforward implementation without many cornercase handling or
struggle with the difficult read/write problem.
+ The result of the analysis is cached through clang::ManagedAnalysis, and
is easily to query.

- Has some crude hacks, the code is a bit challenging to follow and is
rather poorly documented.
- Non-relaxed liveness analysis has tests, but isn't actually used anywhere
in the codebase.
- The choice of data structures is probably inefficient.

---=== 2. UninitializedVariables ===---

The point of this analysis is self explanatory -- catch the usage of a
variable that has been initialized. There are a variety of severities we
can define: a variable is uninitialized on all paths leading to the usage,
or only on some (which can be divided into further subcategories).

Some literature, like [4], talks about uninitialized variable analysis as a
side product of liveness analysis, while others, like [30], describe it as
a distinct analysis. [4] isn't necessarily wrong, if a variable is known to
be uninitialized but is live, then an uninitialized value is read on some
path of execution. However, liveness is an over-approximating algorithm,
meaning this path of execution might be infeasible. For diagnostics,
uninitialized variables should be under-approximating to avoid false
positives. On the implementation side, the categorization of statements as
variable reading/writing requires far more sophistication then what
liveness offers. For this reason, this is a
* Forward analysis: Initializedness is a property of the past.
* Must analysis: We want to be sure that the variable is uninitialized.
However, in practice, we're a bit more lenient on this, as we shall see.

Uninitialized variables doesn't enjoy the advantage Liveness does regarding
the categorization of statements as variable reading/writing, so it
implements multiple passes:

--- First pass: ---

Clang doesn't have any precalculated KILL/GEN sets, though
UninitializedVariables stops just short of doing that. As a distinct pass,
it scans the code associated with the CFG (for the most part, that is a
FunctionDecl), and classifies each DeclRefExpr whether its an
initialization/write, read, read as a const reference, self initialization
(int x = x;), or should be ignored by the analysis.

void use(int x); // note that this is a pass-by-value parameter

int x;
if (rand())
  x = 5; // Write of x
use(x); // Read of x

--- Second pass: ---

This is where uninitialized variable usages are actually calculated.
Indeed, this is reminiscent of liveness analysis: we assign an
uninitialized variables set to each statement, initially with the
information gathered from the first pass, then propagating it with a
fixpoint algorithm. We found an uninitialized variable use, when the
fixpoint algorithm halts, and a set associated with a statement notes a
variable as uninitialized, yet the same statement also reads this variable.

During the fixpoint calculation, there are a few extra steps to be taken
that are absent in LivenessAnalysis:

* Variable declarations (which are not DeclRefExprs, and as such won't be
classified in the first pass), and their lack of initialization are noted
here.
* Handling of special cases: inline asm, ObjectiveCMessageExpr,
OMPExecutableDirective
* If a variable read is found where the variable is sometimes uninitialized
(not on all paths leading to the read), we'll look for block edges that
unconditionally lead to an uninitialized variable. This is used to say
stronger than 'may be uninitialized', namely that 'either it's used
uninitialized or you have dead code'. There is a great block of comments
that details this in the actual code [5].

--- Third pass ---

Though technically this is an implementation detail, a third pass is used
to emit diagnostics. Block visitations can be accompanied with a callback
object, which is a caching object in the second pass, and a report emission
object in the third.

--- Strengths / Deficiencies ---

+ Has a far more sophisticated understanding of when a variable is read
from / written to.
+ Has a great categorization system. Variables that are maybe uninitialized
can be analyzed further to judge how likely the lack of initialization is.

- Lacks support for record fields.
- The third pass looks like a hack, and could probably be factored out.
- The analysis results are non-reusable. Is not a clang::ManagedAnalysis.

---=== 3. Thread Safety analysis ===---

Clang's oldest dataflow analysis, Liveness, was committed on the 6th of
September, 2007 [6]. Thread safety analysis isn't much younger though, the
earliest proposals and discussions popped up in July, 2008 on the GCC
mailing lists [7-8]. Its eventual port to clang came a few years later in
June 2011, which was proposed by Caitlin Sadowski [9], and was presented at
the 2011 LLVM Developers' Meeting [10].

The analysis is a subject of a 2014 article [11], notably after the
analysis matured some. It is an amazing piece of literature, very easily
digestible, but doesn't sacrifice anything for readability. If you want to
learn more about many of the challenges data flow presents in Clang, or
thread safety checking in general, I highly recommend you read it. You can
assume that anything I write about this analysis is cited from either this
paper or the similarly amazing dev talk.

The domain of the analysis is different this time: Liveness looks for
expressions, and both Liveness and UninitializedValue look for local
variables. The domain of Thread Safety analysis is on _annotated_ member or
non-member functions in addition to _annotated_ variables.

Annotations are a very important component of this analysis. The user
annotates which variable or function is guarded by which mutex. Functions
that lock or unlock a mutex are annotated likewise. Annotations on
functions have an interesting effect; though the analysis is still
intraprocedural, it will be able to argue across function calls and achieve
some path-sensitivity [25 at 3:27]: If a function's precondition is to have
a set of mutexes locked, that must be annotated (otherwise Clang will
assume said mutexes are unlocked), and this precondition can be checked on
the caller side. If the precondition and the postcondition have different
set of lucked mutexes, that must be annotated as well (otherwise Clang will
emit a warning), so the callee can check whether the post condition holds.

A set of locked mutexes are computed for each statement in a block. The set
for the last statement in the block becomes the set associated with the
block.

// Some CFGBlock, suppose that no mutexes are locked before reaching it
{
  /*{}      */ std::mutex mu, um;
  /*{}      */ int a GUARDED_BY(mu);
  /*{mu}    */ mu.lock();
  /*{mu, um}*/ um.lock();
  /*{mu, um}*/ foo();
  /*{um}    */ mu.unlock();
}
// {um} is the lock set associated with this block

Suppose you have the following CFG:

             -> [B3] ->
           /            \
[B1] -> [B2] -> [B4] -> [B5] -> [B6]

Suppose the blocks lock the following mutexes before propagating this
information with the use of dataflow:

             -> {mu} ->
           /            \
{  } -> {  } -> {mu} -> {  } -> {(unlocks mu)}

How do we propagate this information throughout the graph? Following the
classic fixpoint algorithms most dataflow analyses implement, this is
simple enough; since all ascendant blocks to B5 locks "mu", it must be
locked in B5 as well (there is no need to think about under/over
approximation). Since B6 unlocks mu, it won't be present in its lock set.

             -> {mu} ->
           /            \
{  } -> {  } -> {mu} -> {mu} -> {  }

Suppose B4 doesn't lock "mu" (for instance, B2 branches on mu.try_lock()).
In that case, the joint point at B5 will have differing incoming lock sets:

             -> {mu} ->
           /            \
{  } -> {  } -> {  } -> {  } -> {(unlocks mu)}

What should the lock set of B5 be after the analysis? What if we introduce
a loop:

             -> {mu} ->
           /            \
{  } -> {  } -> {  } -> {  } -> {(unlocks mu)}
          \              /
           <-------------

We could decide that for each cases where we need to merge differing lock
sets:

* We merge the sets (a may analysis, the lock sets will be overestimated)
* We intersect them (a must analysis, the lock sets will be underestimated)
* Note that the lock may be either locked/unlocked, similarly to
uninitialized variables analysis.

Thread safety analysis cuts the Gordian knot [12] by asserting that code
that allows differing lock sets on merge points is error prone, and emits a
warning. This is a very significant simplification, since there is no
longer a need for a fixpoint algorithm, the sets can be calculated and
propagated in O(N) time for a CFG with N blocks. It could be argued that
this is too restrictive, but [11 p5.] is confident that that allowing such
constructs doesn't have a real practical benefit, and is a sign of code
smell.

--- Thread Safety's own intermediate language ---

Thread Safety analysis allows the annotation of methods and pointees. This
poses a very big problem: the comparison of expressions [16].

 class A { Mutex mu; int dat GUARDED_BY(this->mu); }
 class B { A a; }

 void foo(B* b) {
   (*b).a.mu.lock();     // locks (*b).a.mu
   b->a.dat = 0;         // substitute &b->a for 'this';
                         // requires lock on (&b->a)->mu
   (b->a.mu).unlock();   // unlocks (b->a.mu)
 }

There is only a single mutex that is locked/unlocked, but the AST makes
this difficult to understand. Just as the code states, knowing that
b->a.dat is guarded by b->a.mu is a non-trivial step as well. For this
reason, stunningly, Thread Safety Analysis lowers Clang expressions,
implementing its very own IR, a Typed Intermediate Language, or TIL.

Quoting from [20] (this has changed a lot since then, but captures the
essence of it):

"Unlike a clang Expr, a SExpr does not capture surface syntax, and it does
not distinguish between C++ concepts, like pointers and references, that
have no real semantic differences. This simplicity allows SExprs to be
meaningfully compared, e.g.
        (x)          =  x
        (*this).foo  =  this->foo
        *&a          =  a"

The implementing code is well documented, but I failed to find discussion
previous to the first commit, or some high level description of it. [40]
offers some, but is very in medias res. I'm honestly not confident enough
in my knowledge of IRs to talk about this much. If you have any pointers,
any and all would be appreciated!

--- Some dataflow aspects ---

Thread safety is a forward analysis. To detect an improper write on a
variable that should be guarded by a mutex implies that the read/write
problem needs to be solved. However, unlike UninitializedVariables analysis
which is also a forward analysis, this isn't done in a distinct pass. The
reason behind this is that thread safety doesn't have a transfer function
for DeclRefExpr. Instead, it has a transfer function for each context that
might access a variable: unary/binary operators, lvalue-to-rvalue casts,
call expressions, etc, translates its argument to TIL, and checks whether
it was a variable of interest.

Whenever a function that locks/unlocks a mutex is found, the corresponding
mutex expression (for "a.b.Lock()", this is "a.b", for a function annotated
with "ACQUIRE(mu)", this is "mu") is translated to TIL. These TIL
expressions form the actual lock sets. When a variable is found to have
been read/written, the analyzer checks whether the guarding mutex is in the
lock set.

This analysis doesn't care about the aliasing issue. This can cause false
positives, but the rationale is that these kinds of reports are an
acceptable sacrifice for greater coverage.

That is all I was interested -- I can't recommend [11] and [10] enough if
you wanna dive deeper, or peek at the code, it practically reads itself
aloud.

--- Strengths / Deficiencies ---

+ Exceptional documentation. Articles, design discussions are readily (and
without charge!) available on  the web.
+ Has strong C++ support, it is able to translate Exprs into its own
intermediate language. Seriously, that is nuts.
+ Actively maintained.
+ O(n) analysis time for a CFG with n blocks.
+ Solves the read/write issue without a distinct pass.

- The analysis results are non-reusable. Is not a clang::ManagedAnalysis.
- TIL seems to be severely undertested. You can litter a lot of
llvm_unreachables in the code, and no tests will fail.
- It seems like you can't annotate subfields, but that might as well be a
feature:

struct A { int i; };
struct B {
  A a; // how to annotate B::a.i?
};

---=== 4. Consumed analysis ===---

Before its implementation, consumed analysis was called uniqueness analysis
[21]. In his proposal, Christian Wailes explains the motivations in great
depths in [22]. In terms of development, this analysis laid dormant for
about 5 years, but has received a few improvements last year.

Consumed analysis is very similar, but not as high-tech as Thread Safety
is; it is also a single-pass, annotation based algorithm, that tracks a set
of consumed/unconsumed variables, instead of a set of locked mutexes. While
I didn't devote too much time to this analysis (party because it doesn't
have as much available discussion as Thread Safety, and party because it
seemed really similar to it), I personally think this is the least mature
of the bunch.

---=== 5. Lifetime analysis ===---

There is a fifth major data flow analysis in clang I'm aware of, though it
isn't a part of upstream clang: Lifetime analysis. This was based largely
on Herb Sutter's paper [29], and has been the subject of numerous talks
[24-28].

[29] is an amazing, step-by-step description on the lifetime algorithm.
[25] is a great talk about how this was implemented in Clang specifically,
which was preceded by design discussion on the mailing list [32, 33].
Herb's paper and the dev talk alone are so thorough, I can't really do it
any justice in this survey. As such, this mail will focus on the
"dataflowness" of the algorithm, largely ignoring non-flow sensitive
lifetime analyses and the kinds of ownership categories that were proposed
with annotations, etc. This simplification quickly boils the discussion
down to what is essentially a points-to analysis.

Points-to analysis collects a set of memory regions for each pointer it
might point to at different points of the program. If this set contains
null or invalid, that is a sign of a programmer error. This implies a

* Forward analysis: what a pointer will point to is a property of the future
* May analysis: we are interested in the set of memory regions the pointer
might point to across all paths in the CFG, meaning that we're
overapproximating the real points-to set.

The analysis thoroughly defines the semantics of the points-to sets, or
psets, as literature calls it. Similarly to thread safety analysis with
try_lock(), these sets may change at a condition point as well, should the
a pointer in the pmap (which maps pointers to their psets) participate in
them:

if (p == q) {
  // pset(p) == pset(q)
}
if (p) {
  // pset(p) doesn't contain null
}

They are affected by, well, lifetime:

int *p; // pset(p) == {invalid}
{
  int x;
  p = &x // pset(p) == {&x}
} // pset(p) == {invalid}
*p = 5; // warning

Asserts pose a serious problem for lifetime analysis in particular. Suppose
you have the following assert:

assert(p && q);
...

(bool)(p && q) ? 0 : abort(); // assert macro expanded
...

And the CFG would look like this:

    ------------>           ----> [abort()]
  /              \        /
[p] -> [q] -> [(bool)(p && q)] -> [...]

The path on which pset(p) == {null} merges with the path where pset(p)
doesn't contain null, before the nonreturn abort() block is reached.
Lifetime analysis solves this by an amazing technique; when it encounters
such a pattern, both psets for the false and true branches are kept, doing
a sort of local pathsensivity. Later, on the next branch, the the
appropriate set is propagated to the true/false paths.

Thread safety neatly avoids with a distinct annotation
(ASSERT_EXCLUSIVE_LOCK) that can be used instead of the standard assert.
Liveness, UninitializedVariables sets don't change on conditions, so they
aren't affected.

A may analysis in this context is risky. The set may have included null or
invalid from an infeasible path of execution. To help filter out the
reports, lifetime analysis employs an array of great tricks to find reports
that are far more likely to be true positives. Such is the case when an
assignment of null to a pointer and a dereference of that pointer is in a
dominance or post-dominance relationship, similar to what was written about
UninitializedValues. Its hard to tell whether a statement that adds
null/invalid to a pset is a reaching definition to the dereference point,
but there is no reaching definitions in clang yet, so dominance sets must
suffice for the time being.

Lifetime analysis was conceptualized with strong C++ support from the
get-go. Records object that have an Owner data member are Owners
themselves, non-Owner record objects that have a Pointer data member are
Pointers, and non-Owner, non-Pointer record objects are Aggregates (I left
a lot of rules out, see [29 p8-10] for details). Owners and Pointers are
modeled as a single variable like any other non-record object, and
aggregates are exploded, each field is regarded as a variable:

struct A { int x, y; };
struct B { A a; int z; };

B b; // variables: b.a.x, b.a.y, b.a, b.z, b

The Variable [25] class accurately captures the domain of the analysis. It
gracefully handles CXXThis, temporaries and local variables.

As the psets are calculated, the read/write problem pops up naturally. [29
p12-24] thoroughly explains what the solution is; lets see an example
before the explanation:

a = &b.a;

Relevant part of the CFGBlock:

4: b
5: [B1.4].a
6: &[B1.5]
7: aptr
8: [B1.7] = [B1.6]

The created psets:

Set RefersTo[DeclRefExpr] = (b)
Set RefersTo[MemberExpr] = (b.a)
Set PSetsOfExpr[UnaryOperator] = (b.a)
Set RefersTo[DeclRefExpr] = (aptr)
PMap[(aptr)] = (b.a)
Set RefersTo[BinaryOperator] = (aptr)

At CFGElement 4, "b" is assigned the pset {b}, at E5 "b.a" is assigned
{b.a}, at E6 "&b.a" is assigned {b.a}, and at last, at E7 "aptr" is
assigned {aptr}. The transfer function for the assignment sees that the
pset on the LHS, {aptr}, resolves to the Variable "aptr", and assigns it
the pset on the RHS, {b.a}.

In essence, the transfer function for each expression calculates their
respective psets, while the transfer function for some expressions
propagate these sets. This implies that at the point of a DeclRefExpr, the
analysis doesn't need to know whether the pset is in the context of an read
or write.

The domain of this analysis is entirely different than liveness or
uninitialized variables, so what was previously mentioned as GEN/KILL sets
don't make sense here. Also, [29] calls a variable killed when it goes out
of scope, not when it is written.

--- Strengths / Deficiencies ---

+ Unlike other analyses, the relevant literature is centered around C++, so
the challenges that the language poses doesn't present an immediate
engineering problem.
+ Exceptionally well documented in literature/talks.
+ The same algorithm is already implemented in MSVC, so it is battle-tested.
+ Excellent support for C++ via the Variable class.
+ Solves the read/write problem in a single pass as a forward analysis.

- The analysis results are non-reusable. Is not a clang::ManagedAnalysis.
- Still WIP, it uses inefficient data structures, and has a lot of
TODOs/FIXMEs.
- Has to compensate for the fact that psets are overestimated, but lacks
the toolset to do it particularly well.

---=== 6. Reaching definitions? ===---

If you wanna see how would somebody implement a dataflow algorithm without
knowing what on earth dataflow even is, or have done any meaningful
research, I recommend looking through my earlier works. The idea to
implement this was first fetched in a GSoC meeting in June 11th, 2019, and
the first mail on the list was [36]. Conversations about using it as a part
of my GSoC to help on the static analyzer's report generation followed in
[37, 41]. Since then, it popped up as an idea here and there, the above
mentioned links will guide you to them, if you care.

At this point, I was confident in my ability to implement this algorithm,
because I read the relevant wikipedia article, and [4]'s relevant chapter.
I never understood what transfer functions are, let alone what lattice
theory is, and never bothered to look at other dataflow analyses. [38] is
the latest version I uploaded, and the questions raised made me realize how
far behind I was truly lagging. Hence this survey!

---=== Conclusion ===---

[31 chapter 9.3] describes a dataflow analysis framework as follows: it is
a (D, V, ^, F) quadruple, where
* D is the direction of the dataflow,
* V is the lattice, which includes the domain of the analysis,
* ^ is the meet operator,
* F: V -> V, a family of transfer functions.

Out of these 4, we have D factored out [39], and F, if we regard statement
visitors as such. As for the other two, an agreement on the most efficient
data structure with a merge/intersect operation would tie things together.
All dataflow algorithms above need to manage sets for each statement in the
CFG, and these sets are usually created from one another. Uninitialized
variables and thread safety, and literature in general agree that
bitvectors are a great choice, yet both of these analyses implement their
own solution.

No analyses but liveness is reusable.

Support for the fields of a record are only supported by thread safety and
lifetime. Thread safety's TIL is amazing, but it is also TIL's only user,
and it only uses a tiny fraction of it. Also, as a compiler novice myself,
I found that it drastically increased the barrier of entry to its codebase,
though I'm yet to study IRs in depth. Lifetime's approach is much simpler,
and I admit to have not understood what really justifies TIL.

For GEN/KILL analyses, such as liveness, uninitialized variables and
reaching definitions, it should be possible to factor a lot of knowledge
out about what statements read/write a variable. However, these analyses
don't always agree on what is a GEN or KILL, so any attempt at refactoring
must be configurable to some extent.

---=== Further reading ===---

I'm hardly a scholar on dataflow analyses, and I haven't taken all of the
available compiler studies related courses in my uni. Though I'm trying to
make a decent dent in all the great books and scholarly articles I have
access to, I've got more to go; here are some sources I used and can
recommend if you are looking for more dataflow knowledge:

* Engineering a Compiler [4] is an excellent book on compiler design. It
starts scanners, parsers, talks about intermediate representations, various
forms of dataflow analyses, optimization all the way to code generation. As
one can tell, this isn't laser focused on static analysis for diagnostic
purposes, and definitely aims to use the fruits of static analysis mainly
for optimization. Nevertheless, static analyses engineered for optimization
are still an invaluable tool for spotting programming errors. Chapters on
intermediate representations and dataflow analyses are a great read, though
in my view they don't trivially translate to how one can implement them in
Clang.

* Static Program Analysis [30] is an amazing and completely free book on
static analysis. It takes a large step towards formality compared to [4],
and offers a numerous exercises for the reader, though isn't quite as
thorough as [31]. It presents lattice theory, a foundation of the presented
analyses, and shows that building on this theory gracefully handles the
problem of over/under approximation. This book gave me a far better
understanding of dataflow, and stands a lot closer to what I've observed in
Clang. Up to this point, this has definitely been my favourite book to
learn from regarding this subject.

* Compilers: Principles, Techniques and Tools [31], AKA the second edition
of the infamous dragon book. As I understand it, its a bigger and better
version of [4], but doesn't have same target audience. [4] is a lot
friendlier to beginners, and doesn't delve into the theoretical background
of a topic unless its necessary to understand the chapter. It is more
formal, laying out the mathematical foundations for most items. Chapter 9.
"Machine independent optimizations" is a dataflow treasure chest,
practically a must read, but as a novice, I definitely appreciated that I
already read up on dataflow before.

* For practical applications of dataflow analyses, all the references below
are great; they are mostly a sample of design discussions/patches of the
past 12 years in Clang.

---=== Closing words ===---

Special thanks to Gábor Horváth and Gábor Márton for helping me collect
some relevant literature and discussions! I definitely plan to spin parts
of this mail into a paper of some sort in the future, please don't beat me
to it :)

Cheers!
Kristóf Umann

---=== References ===---

[1] [cfe-dev] [analyzer] An in-depth look at Liveness Analysis in the
Static Analyzer,
http://lists.llvm.org/pipermail/cfe-dev/2020-July/066330.html
[2]
https://clang.llvm.org/docs/AttributeReference.html#consumed-annotation-checking,
clang/test/SemaCXX/warn-consumed-analysis.cpp
[3] https://clang.llvm.org/docs/ThreadSafetyAnalysis.html
[4] Cooper, Keith, and Linda Torczon. Engineering a compiler, second
edition. Elsevier, 2011.
[5]
https://github.com/llvm/llvm-project/blob/054704082b461418d3dac3a379792cdaf52d40b3/clang/lib/Analysis/UninitializedValues.cpp#L511
[6] Added an early implementation of Live-Variables analysis built on
source-level CFGs. This code may change significantly in the near future as
we explore different means to implement dataflow analyses.
https://reviews.llvm.org/rGb56a9909556aee1670e88adcf8b0a56dc25c5c9e
[7] Thread safety annotations and analysis in GCC,
https://gcc.gnu.org/pipermail/gcc/2008-June/177461.html, continuing in
https://gcc.gnu.org/pipermail/gcc/2008-July/178359.html
[8] C/C++ Thread Safety Annotations proposal,
https://docs.google.com/document/d/1_d9MvYX3LpjTk3nlubM5LE4dFmU91SDabVdWp9-VDxc/edit
[9] [cfe-dev] Proposal for thread safety attributes for Clang,
http://lists.llvm.org/pipermail/cfe-dev/2011-June/015899.html, continuing
in http://lists.llvm.org/pipermail/cfe-dev/2011-July/015906.html
[10] 2011 LLVM Developers’ Meeting: D. Hutchins “Thread Safety Annotations
in Clang”, https://www.youtube.com/watch?v=1em66mRozm0&ab_channel=LLVM
[11] Hutchins, DeLesley, Aaron Ballman, and Dean Sutherland. "C/c++ thread
safety analysis." 2014 IEEE 14th International Working Conference on Source
Code Analysis and Manipulation. IEEE, 2014.,
[12] https://en.wikipedia.org/wiki/Gordian_Knot
[13] 2019 EuroLLVM Developers’ Meeting: T. Shpeisman & C. Lattner “MLIR:
Multi-Level Intermediate Representation for Compiler Infrastructure,
https://www.youtube.com/watch?v=qzljG6DKgic
[14] [cfe-dev] LLVM Dev meeting: Slides & Minutes from the Static Analyzer
BoF (2015. November)
http://lists.llvm.org/pipermail/cfe-dev/2015-November/045872.html
[15] [cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision,
http://lists.llvm.org/pipermail/cfe-dev/2020-August/066537.html
[16]
https://github.com/llvm/llvm-project/blob/master/clang/include/clang/Analysis/Analyses/ThreadSafetyTIL.h#L27
[17] [cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision,
mentioning the talk,
http://lists.llvm.org/pipermail/cfe-dev/2020-August/066611.html
[18] CIL : Common MLIR Dialect for C/C++ and Fortran, 2020 Virtual LLVM
Developers' Meeting.
[19] [llvm-dev] MLIR for clang
http://lists.llvm.org/pipermail/llvm-dev/2020-February/139192.html
[20] Thread safety analysis: refactor to support more sophisticated
handling. https://reviews.llvm.org/rC161690
[21] [patch] Adding Consumed Analysis to Clang,
https://reviews.llvm.org/D1233
[22] [cfe-dev] Proposal for Uniqueness Analysis,
http://lists.llvm.org/pipermail/cfe-dev/2013-July/030635.html, continuing
in http://lists.llvm.org/pipermail/cfe-dev/2013-August/031331.html
[23] https://github.com/mgehre/llvm-project/tree/lifetime
[24] CppCon 2018: Herb Sutter “Thoughts on a more powerful and simpler C++
(5 of N)” https://youtu.be/80BZxujhY38?t=1095
[25] 2019 EuroLLVM Developers’ Meeting: G. Horvath & M. Gehre: Implementing
the C++ Core Guidelines' Lifetime Safety Profile in Clang
https://www.youtube.com/watch?v=VynWyOIb6Bk&ab_channel=LLVM
[26] CppCon 2018: “Implementing the C++ Core Guidelines’ Lifetime Safety
Profile in Clang”
https://www.youtube.com/watch?v=sjnp3P9x5jA&ab_channel=CppCon
[27] CppCon 2019: Gábor Horváth, Matthias Gehre “Lifetime analysis for
everyone”, https://www.youtube.com/watch?v=d67kfSnhbpA&ab_channel=CppCon
[28] Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth.
CoreHard Spring 2019,
https://www.youtube.com/watch?v=EeEjgT4OJ3E&ab_channel=corehard
[29] Herb Sutter: Lifetime profile v1.0 posted.
https://herbsutter.com/2018/09/20/lifetime-profile-v1-0-posted/
[30] Møller, Anders, and Michael I. Schwartzbach. "Static program
analysis." Notes. Feb (2012). https://users-cs.au.dk/amoeller/spa/spa.pdf
[31] A. V. Aho, Monica S. Lam, R. Sethi, Jeffrey D. Ullman. "Compilers:
Principles, Techniques and Tools", Pearson New International Edition, 2nd
Edition
[32] [cfe-dev] [RFC] Adding lifetime analysis to clang,
http://clang-developers.42468.n3.nabble.com/RFC-Adding-lifetime-analysis-to-clang-tt4063068.html
(the
thread is very hard to follow on cfe-dev, so this site is a better bet)
[33] [cfe-dev] [RFC] Upstreaming Lifetime Function Annotations
http://lists.llvm.org/pipermail/cfe-dev/2019-December/064067.html
[34][analyzer][Liveness][NFC] Remove an unneeded pass to collect variables
that appear in an assignment, https://reviews.llvm.org/D87518
[35]
https://github.com/mgehre/llvm-project/blob/5fe46e6702499ec95280a46b3a56c7092ab44660/clang/include/clang/Analysis/Analyses/LifetimePset.h#L23
[36] [cfe-dev] [analyzer] Speaking about reaching definitions...
http://lists.llvm.org/pipermail/cfe-dev/2019-June/062733.html
[37] [cfe-dev] [analyzer] Using reaching definitions analysis for bug
report construction,
http://lists.llvm.org/pipermail/cfe-dev/2019-July/062885.html
[38] [analysis][analyzer] Introduce the skeleton of a reaching definitions
calculator https://reviews.llvm.org/D76287
[39] [DataFlow] Factor two worklist implementations out,
https://reviews.llvm.org/D72380
[40] [cfe-dev] Next step (discussion on TIL)
http://lists.llvm.org/pipermail/cfe-dev/2014-May/036983.html
[41] GSoC 2019: Enhancing bug reports in the Clang Static Analyzer,
https://szelethus.github.io/gsoc2019/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20201004/f8632bec/attachment-0001.html>