[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Tue Oct 12 08:32:35 PDT 2021

Summary

We propose a new static analysis framework for implementing Clang bug-finding
and refactoring tools (including ClangTidy checks) based on a dataflow
algorithm.

Some ClangTidy checks and core compiler warnings are already dataflow based.
However, due to a lack of a reusable framework, all reimplement common
dataflow-analysis and CFG-traversal algorithms. In addition to supporting
new checks, the framework should allow us to clean up those parts of Clang
and ClangTidy.

We intend that the new framework should have a learning curve similar to
AST matchers and not require prior expertise in static analysis. Our goal
is a simple API that isolates the user from the dataflow algorithm itself.
The initial version of the framework combines support for classic dataflow
analyses over the Clang CFG with symbolic evaluation of the code and a
simple model of local variables and the heap. A number of useful checks can
be expressed solely in terms of this model, saving their implementers from
having to reimplement modeling of the semantics of C++ AST nodes.
Ultimately, we want to extend this convenience to a wide variety of checks,
so that they can be written in terms of the abstract semantics of the
program (as modeled by the framework), while only matching and modeling AST
nodes for key parts of their domain (e.g., custom data types or functions).
Motivation

Clang has a few subsystems that greatly help with implementing bug-finding
and refactoring tools:

   -

   AST matchers allow one to easily find program fragments whose AST
   matches a given pattern. This works well for finding program fragments
   whose desired semantics translates to a small number of possible syntactic
   expressions (e.g., "find a call to strlen() where the argument is a null
   pointer constant"). Note that patterns expressed by AST matchers often
   check program properties that hold for all execution paths.
   -

   Clang Static Analyzer can find program paths that end in a state that
   satisfies a specific user-defined property, regardless of how the code was
   expressed syntactically (e.g., "find a call to strlen() where the
   argument is dynamically equal to null"). CSA abstracts from the syntax of
   the code, letting analyses focus on its semantics. While not completely
   eliding syntax, symbolic evaluation abstracts over many syntactic details.
   Semantically equivalent program fragments result in equivalent program
   states computed by the CSA.

When implementing certain bug-finding and refactoring tools, it is often
necessary to find program fragments where a given property holds on all
execution paths, while placing as few demands as possible on how the user
writes their code. The proposed dataflow analysis framework aims to satisfy
both of these needs simultaneously.
Example: Ensuring that std::optional is used safely

Consider implementing a static analysis tool that proves that every time a
std::optional is unwrapped, the program has ensured that the precondition
of the unwrap operation (the optional has a value) is satisfied.

void TestOK(std::optional<int> x) {

  if (x.has_value()) {

    use(x.value());

  }

}

void TestWithWarning(std::optional<int> x) {

  use(x.value()); // warning: unchecked access to optional value

}

If we implemented this check in CSA, it would likely find the bug in
TestWithWarning, and would not report anything in TestOK. However, CSA did
not prove the desired property (all optional unwraps are checked on all
paths) for TestOK; CSA just could not find a path through TestOK where this
property is violated.

A simple function like TestOK only has two paths and would be completely
explored by the CSA. However, when loops are involved, CSA uses heuristics
to decide when to stop exploring paths in a function. As a result, CSA
would miss this bug:

void MissedBugByCSA(std::optional<int> x) {

  for (int i = 0; i < 10; i++) {

    if (i < 5) {

      use(i);

    } else {

      use(x.value()); // CSA does not report a warning

    }

  }

}

If we implemented this check with AST matchers (as typically done in
ClangTidy), we would have to enumerate way too many syntactic patterns.
Consider just a couple coding variations that users expect this analysis to
understand to be safe:

void TestOK(std::optional<int> x) {

  if (x) {

    use(x.value());

  }

}

void TestOK(std::optional<int> x) {

  if (x.has_value() == true) {

    use(x.value());

  }

}

void TestOK(std::optional<int> x) {

  if (!x.has_value()) {

    use(0);

  } else {

    use(x.value());

  }

}

void TestOK(std::optional<int> x) {

  if (!x.has_value())

    return;

  use(x.value());

}

To summarize, CSA can find some optional-unwrapping bugs, but it does not
prove that the code performs a check on all paths. AST matchers can prove
it for a finite set of hard-coded syntactic cases. But, when the set of
safe patterns is infinite, as is often the case, AST matchers can only
capture a subset. Capturing the full set requires building a framework like
the one presented in this document.
Comparison

AST Matchers Library

Clang Static Analyzer

Dataflow Analysis Framework

Proves that a property holds on all execution paths

Possible, depends on the matcher

No

Yes

Checks can be written in terms of program semantics

No

Yes

Yes
Further Examples

While we have started with a std::optional checker, we have a number of
further use cases where this framework can help. Note that some of these
are already implemented in complex, specialized checks, either directly in
the compiler or as clang-tidy checks. A key goal of our framework is to
simplify the implementation of such checks in the future.

   -

   “the pointer may be null when dereferenced”, “the pointer is always null
   when dereferenced”,
   -

   "unnecessary null check on a pointer passed to delete/free",
   -

   “this raw pointer variable could be refactored into a std::unique_ptr”,
   -

   “expression always evaluates to a constant”, “if condition always
   evaluates to true/false”,
   -

   “basic block is never executed”, “loop is executed at most once”,
   -

   “assigned value is always overwritten”,
   -

   “value was used after having been moved”,
   -

   “value guarded by a lock was used without taking the lock”,
   -

   “unnecessary existence check before inserting a value into a std::map”.

Some of these use cases are covered by existing ClangTidy checks that are
implemented with pattern matching. We believe there is an opportunity to
simplify their implementation by refactoring them to use the dataflow
framework, and simultaneously improve the coverage and precision of how
they model C++ semantics.
Overview

The initial version of our framework provides the following key features:

   -

   a forward dataflow algorithm, parameterized by a user-defined value
   domain and a transfer function, which describes how program statements
   modify domain values.
   -

   a built-in model of local variables and the heap. This model provides
   two benefits to users: it tracks the flow of values, accounting for a wide
   variety of C++ language constructs, in a way that is reusable across many
   different analyses; it discriminates value approximations based on path
   conditions, allowing for higher precision than dataflow alone.
   -

   an interface to an SMT solver to verify safety of operations like
   dereferencing a pointer,
   -

   test infrastructure for validating results of the dataflow algorithm.

Dataflow Algorithm

Our framework solves forward dataflow equations over a product of a
user-defined value domain and a built-in model of local variables and the
heap, called the environment. The user implements the transfer function for
their value domain, while the framework implements the transfer function
that operates on the environment, only requiring a user-defined widening
operator to avoid infinite growth of the environment model. We intend to
expand the framework to also support backwards dataflow equations at some
point in the future. The limitation to the forward direction is not
essential.

In more detail, the algorithm is parameterized by:

   -

   a user-defined, finite-height partial order, with a join function and
   maximal element,
   -

   an initial element from the user-defined lattice,
   -

   a (monotonic) transfer function for statements, which maps between
   lattice elements,
   -

   comparison and merge functions for environment values.

We expand on the first three elements here and return to the comparison and
merge functions in the next section.
The Lattice and Analysis Abstractions

We start with a representation of a lattice:

class BoundedJoinSemiLattice {

 public:

  friend bool operator==(const BoundedJoinSemiLattice& lhs,

                         const BoundedJoinSemiLattice& rhs);

  friend bool operator<=(const BoundedJoinSemiLattice& lhs,

                         const BoundedJoinSemiLattice& rhs);

  static BoundedJoinSemiLattice Top();

  enum class JoinEffect {

    Changed,

    Unchanged,

  };

  JoinEffect Join(const BoundedJoinSemiLattice& element);

};

Not all of these declarations are necessary. Specifically,  operator== and
operator<= can be implemented in terms of Join and Top isn’t used by the
dataflow algorithm. We include the operators because direct implementations
are likely more efficient, and Top because it provides evidence of a
well-founded bounded join semi-lattice.

The Join operation is unusual in its choice of mutating the object instance
and returning an indication of whether the join resulted in any changes. A
key step of the fixpoint algorithm is code like j = Join(x, y); if (j == x)
…. This formulation of Join allows us to express this code instead as if
(x.Join(y) == Unchanged).... For lattices whose instances are expensive to
create or compare, we expect this formulation to improve the performance of
the dataflow analysis.

Based on the lattice abstraction, our analysis is similarly traditional:

class DataflowAnalysis {

 public:

  using BoundedJoinSemiLattice = …;

  BoundedJoinSemiLattice InitialElement();

  BoundedJoinSemiLattice TransferStmt(

      const clang::Stmt* stmt, const BoundedJoinSemiLattice& element,

      clang::dataflow::Environment& environment);

};

Environment: A Built-in Lattice

All user-defined operations have access to an environment, which
encapsulates the program context of whatever program element is being
considered. It contains:

   -

   a path condition,
   -

   a storage model (which models both local variables and the heap).

The built-in transfer functions model the effects of C++ language
constructs (for example, variable declarations, assignments, pointer
dereferences, arithmetic etc.) by manipulating the environment, saving the
user from doing so in custom transfer functions.

The path condition accumulates the set of boolean conditions that are known
to be true on every path from function entry to the current program point.

The environment maintains the maps from program declarations and pointer
values to storage locations. It also maintains the map of storage locations
to abstract values. Storage locations can be atomic (“scalar”) or aggregate
multiple sublocations:

   -

   ScalarStorageLocation is a storage location that is not subdivided
   further for the purposes of abstract interpretation, for example bool,
   int*.
   -

   AggregateStorageLocation is a storage location which can be subdivided
   into smaller storage locations that can be independently tracked by
   abstract interpretation. For example, a struct with public members.

In addition to this storage structure, our model tracks three kinds of
values: basic values (like integers and booleans), pointers (which reify
storage location into the value space) and records with named fields, where
each field is itself a value. For aggregate values, we ensure coherence
between the structure represented in storage locations and that represented
in the values: for a given aggregate storage location Lagg with child f:

   -

   valueAt(Lagg) is a record, and
   -

   valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),

where valueAt maps storage locations to values.

In our first iteration, the only basic values that we will track are
boolean values. Additional values and variables may be relevant, but only
in so far as they constitute part of a boolean expression that we care
about. For example, if (x > 5) { … } will result in our tracking x > 5, but
not any particular knowledge about the integer-valued variable x.

The path conditions and the model refer to the same sets of values and
locations.
Accounting for infinite height

Despite the simplicity of our storage model, its elements can grow
infinitely large: for example, path conditions on loop back edges, records
when modeling linked lists, and even just the number of locations for a
simple for-loop.  Therefore, we provide the user with two opportunities to
interpret the abstract values so as to bound it: comparison of abstract
values and a merge operation at control-flow joins.

  // Compares values from two different environments for semantic
equivalence.

  bool Compare(

      clang::dataflow::Value* value1, clang::dataflow::Environment& env1,

      clang::dataflow::Value* value2, clang::dataflow::Environment& env2);

  // Given `value1` (w.r.t. `env1`) and `value2` (w.r.t. `env2`), returns a

  // `Value` (w.r.t. `merged_env`) that approximates `value1` and `value2`.
This

  // could be a strict lattice join, or a more general widening operation.

  clang::dataflow::Value* Merge(

      clang::dataflow::Value* value1, clang::dataflow::Environment& env1,

      clang::dataflow::Value* value2, clang::dataflow::Environment& env2,

      clang::dataflow::Environment& merged_env);
Verification by SAT Solver

The path conditions are critical in refining our approximation of the local
variables and heap. At an abstract level, they don’t need to directly
influence the execution of the dataflow analysis. Instead, we can imagine
that, once concluded, our results are annotated with/incorporate path
conditions, which in turn allows us to make more precise conclusions about
our program from the analysis results.

For example, if the analysis concludes that a variable conditionally holds
a non-null pointer, and, for a given program point, the path condition
implies said condition, we can conclude that a dereference of the pointer
at that point is safe. In this sense, the verification of the safety of a
deference is separate from the analysis, which models memory states with
formulae. That said, invocation of the solver need not wait until after the
analysis has concluded. It can also be used during the analysis, for
example, in defining the widening operation in Merge.

With that, we can see the role of an SAT solver: to solve questions of the
form “Does the path condition satisfy the pre-condition computed for this
pointer variable at this program point”? To that end, we include an API for
asking such questions of a SAT solver. The API is compatible with Z3 and we
also include our own (simple) SAT solver, for users who may not want to
integrate with Z3.
Testing dataflow analysis

We provide test infrastructure that allows users to easily write tests that
make assertions about a program state computed by dataflow analysis at any
code point. Code is annotated with labels on points of interest and then
test expectations are written with respect to the labeled points.

For example, consider an analysis that computes the state of optional
values.

ExpectDataflowAnalysis<OptionalChecker>(R"(

void f(std::optional<int> opt) {

  if (opt.has_value()) {

    // [[true-branch]]

  } else {

    // [[false-branch]]

  }

})", [](const auto& results) {

  EXPECT_LE(results.ValueAtCodePoint("true-branch"), OptionalLattice::
Engaged());

  EXPECT_LE(results.ValueAtCodePoint("false-branch"), OptionalLattice::
NullOpt());

});

Timeline

We have a fully working implementation of the framework as proposed here.
We also are interested in contributing a clang-tidy checker for
std::optional and related types as soon as possible, given its demonstrated
efficacy and interest from third parties.  To that end, as soon as we have
consensus for starting this work in the clang repository, we plan to start
sending patches for review. Given that the framework is relatively small,
we expect it will fit within 5 (largish) patches, meaning we can hope to
have it available by December 2021.
Future Work

We consider our first version of the framework as experimental. It suits
the needs of a number of particular checks that we’ve been writing, but it
has some obvious limitations with respect to general use.  After we commit
our first version, here are some planned areas of improvement and
investigation.

Backwards analyses. The framework only solves forward dataflow equations.
This clearly prevents implementing some very familiar and useful analyses,
like variable liveness. We intend to expand the framework to also support
backwards dataflow equations. There’s nothing inherent to our framework
that limits it to forwards.

Reasoning about non-boolean values. Currently, we can symbolically model
equality of booleans, but nothing else. Yet, for locations on the store
that haven’t changed, we can understand which values are identical (because
the input value object is passed by the transfer function to the output).
For example,

void f(bool *a, bool *b) {

    std::optional<int> opt;

    if (a == b) { opt = 42; }

    if (a == b) { opt.value(); }

}

Since neither a nor b have changed, we can reason that the two occurrences
of the expression a == b evaluate to the same value (even though we don't
know the exact value). We intend to improve the framework to support this
increased precision, given that it will not cost us any additional
complexity.

Improved model of program state. A major obstacle to writing high-quality
checks for C++ code is the complexity of C++’s semantics.  In writing our
framework, we’ve come to see the value of shielding the user from the need
to handle these directly by modeling program state and letting the user
express their check in terms of that model, after it has been solved for
each program point. As is, our model currently supports a number of checks,
including the std::optional check and another that infers which pointer
parameters are out parameters. We intend to expand the set of such checks,
with improvements to the model and the API for expressing domain-specific
semantics relevant to a check. Additionally, we are considering explicitly
tracking other abstract program properties, like reads and writes to local
variables and memory, so that users need not account for all the syntactic
variants of these operations.

Refactoring existing checks. Once the framework is more mature and stable,
we would like to refactor existing checks that use custom, ad-hoc
implementations of dataflow to use this framework.
Conclusion

We have proposed an addition to the Clang repository of a new static
analysis framework targeting all-paths analyses. The intended clients of
such analyses are code-checking and transformation tools like clang-tidy,
where sound analyses can support sound transformations of code. We
discussed the key features and APIs of our framework and how they fit
together into a new analysis framework. Moreover, our proposal is based on
experience with a working implementation that supports a number of
practical, high-precision analyses, including one for safe use of types
like std::optional and another that finds missed opportunities for use of
std::move.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211012/8a8627ed/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4000 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211012/8a8627ed/attachment-0001.bin>