[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Tue Oct 12 12:01:54 PDT 2021

Hi Yitzhak,

Yay, finally happening! I 100% support these efforts and feel free to add
me as a reviewer to the patches, design documents etc.
I have some questions/comments inline.

On Tue, 12 Oct 2021 at 08:32, Yitzhak Mandelbaum <yitzhakm at google.com>
wrote:

> Summary
>
> We propose a new static analysis framework for implementing Clang bug-finding
> and refactoring tools (including ClangTidy checks) based on a dataflow
> algorithm.
>
> Some ClangTidy checks and core compiler warnings are already dataflow based.
> However, due to a lack of a reusable framework, all reimplement common
> dataflow-analysis and CFG-traversal algorithms. In addition to supporting
> new checks, the framework should allow us to clean up those parts of
> Clang and ClangTidy.
>

We do have some helpers to make writing ad-hoc dataflow analyses somewhat
easier, e.g., see llvm-project/DataflowWorklist.h at main ·
llvm/llvm-project (github.com)
<https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Analysis/FlowSensitive/DataflowWorklist.h>
But I do agree there is a lot of room for improvement and having a
full-fledged framework could be a great way forward.

There is a nice summary of some of the ad-hoc solutions we have in Clang in
this survey: [cfe-dev] A survey of dataflow analyses in Clang (llvm.org)
<https://lists.llvm.org/pipermail/cfe-dev/2020-October/066937.html>
Do you plan to replace the ad-hoc solutions in the long run?

>
> We intend that the new framework should have a learning curve similar to
> AST matchers and not require prior expertise in static analysis. Our goal
> is a simple API that isolates the user from the dataflow algorithm itself.
> The initial version of the framework combines support for classic dataflow
> analyses over the Clang CFG with symbolic evaluation of the code and a
> simple model of local variables and the heap. A number of useful checks can
> be expressed solely in terms of this model, saving their implementers from
> having to reimplement modeling of the semantics of C++ AST nodes.
> Ultimately, we want to extend this convenience to a wide variety of checks,
> so that they can be written in terms of the abstract semantics of the
> program (as modeled by the framework), while only matching and modeling AST
> nodes for key parts of their domain (e.g., custom data types or functions).
>

Just to reiterate, you want to abstract away both from the dataflow
algorithm and the semantics of C++. E.g., the user might not need to handle
the myriad of ways someone can assign a value to a variable (overloaded
operator=, output argument, primitive assignment, etc), only a single
abstract concept of assignment.

> Motivation
>
> Clang has a few subsystems that greatly help with implementing bug-finding
> and refactoring tools:
>
>    -
>
>    AST matchers allow one to easily find program fragments whose AST
>    matches a given pattern. This works well for finding program fragments
>    whose desired semantics translates to a small number of possible syntactic
>    expressions (e.g., "find a call to strlen() where the argument is a
>    null pointer constant"). Note that patterns expressed by AST matchers often
>    check program properties that hold for all execution paths.
>    -
>
>    Clang Static Analyzer can find program paths that end in a state that
>    satisfies a specific user-defined property, regardless of how the code was
>    expressed syntactically (e.g., "find a call to strlen() where the
>    argument is dynamically equal to null"). CSA abstracts from the syntax of
>    the code, letting analyses focus on its semantics. While not completely
>    eliding syntax, symbolic evaluation abstracts over many syntactic details.
>    Semantically equivalent program fragments result in equivalent program
>    states computed by the CSA.
>
>
> When implementing certain bug-finding and refactoring tools, it is often
> necessary to find program fragments where a given property holds on all
> execution paths, while placing as few demands as possible on how the user
> writes their code. The proposed dataflow analysis framework aims to satisfy
> both of these needs simultaneously.
> Example: Ensuring that std::optional is used safely
>
> Consider implementing a static analysis tool that proves that every time a
> std::optional is unwrapped, the program has ensured that the precondition
> of the unwrap operation (the optional has a value) is satisfied.
>
> void TestOK(std::optional<int> x) {
>
>   if (x.has_value()) {
>
>     use(x.value());
>
>   }
>
> }
>
> void TestWithWarning(std::optional<int> x) {
>
>   use(x.value()); // warning: unchecked access to optional value
>
> }
>
> If we implemented this check in CSA, it would likely find the bug in
> TestWithWarning, and would not report anything in TestOK. However, CSA
> did not prove the desired property (all optional unwraps are checked on all
> paths) for TestOK; CSA just could not find a path through TestOK where
> this property is violated.
>
> A simple function like TestOK only has two paths and would be completely
> explored by the CSA. However, when loops are involved, CSA uses heuristics
> to decide when to stop exploring paths in a function. As a result, CSA
> would miss this bug:
>
> void MissedBugByCSA(std::optional<int> x) {
>
>   for (int i = 0; i < 10; i++) {
>
>     if (i < 5) {
>
>       use(i);
>
>     } else {
>
>       use(x.value()); // CSA does not report a warning
>
>     }
>
>   }
>
> }
>
> If we implemented this check with AST matchers (as typically done in
> ClangTidy), we would have to enumerate way too many syntactic patterns.
> Consider just a couple coding variations that users expect this analysis to
> understand to be safe:
>
> void TestOK(std::optional<int> x) {
>
>   if (x) {
>
>     use(x.value());
>
>   }
>
> }
>
> void TestOK(std::optional<int> x) {
>
>   if (x.has_value() == true) {
>
>     use(x.value());
>
>   }
>
> }
>
> void TestOK(std::optional<int> x) {
>
>   if (!x.has_value()) {
>
>     use(0);
>
>   } else {
>
>     use(x.value());
>
>   }
>
> }
>
> void TestOK(std::optional<int> x) {
>
>   if (!x.has_value())
>
>     return;
>
>   use(x.value());
>
> }
>
> To summarize, CSA can find some optional-unwrapping bugs, but it does not
> prove that the code performs a check on all paths. AST matchers can prove
> it for a finite set of hard-coded syntactic cases. But, when the set of
> safe patterns is infinite, as is often the case, AST matchers can only
> capture a subset. Capturing the full set requires building a framework like
> the one presented in this document.
>

While most of the checks in CSA are using symbolic execution, there are
also some syntactic rules. One dilemma of implementing a new check is where
to put it (CSA? Tidy? Warnings?). While it would be great to get rid of
this dilemma, I think we want to keep all of the options open. I just
wanted to make it explicit what kinds of layering constraints we have. It
should be possible to add CSA checks/features using the new dataflow
framework.

> Comparison
>
> AST Matchers Library
>
> Clang Static Analyzer
>
> Dataflow Analysis Framework
>
> Proves that a property holds on all execution paths
>
> Possible, depends on the matcher
>
> No
>
> Yes
>
> Checks can be written in terms of program semantics
>
> No
>
> Yes
>
> Yes
>

Another example where the forward symbolic execution of CSA falls short is
backwards reasoning, e.g.:

void f(int *p)
{
  *p = 5;
  if (p) // Should warn: redundant null check or null dereference above.
  {
    ...
  }
}

CSA will not warn on the code snippet above because it is assumed p is
never null at the first dereference (to avoid noise) and it is really hard
to revise old assumptions given new evidence. Such problems can be found
using dataflow analysis.

> Further Examples
>
> While we have started with a std::optional checker, we have a number of
> further use cases where this framework can help. Note that some of these
> are already implemented in complex, specialized checks, either directly in
> the compiler or as clang-tidy checks. A key goal of our framework is to
> simplify the implementation of such checks in the future.
>
>    -
>
>    “the pointer may be null when dereferenced”, “the pointer is always
>    null when dereferenced”,
>    -
>
>    "unnecessary null check on a pointer passed to delete/free",
>    -
>
>    “this raw pointer variable could be refactored into a std::unique_ptr”,
>    -
>
>    “expression always evaluates to a constant”, “if condition always
>    evaluates to true/false”,
>    -
>
>    “basic block is never executed”, “loop is executed at most once”,
>    -
>
>    “assigned value is always overwritten”,
>    -
>
>    “value was used after having been moved”,
>    -
>
>    “value guarded by a lock was used without taking the lock”,
>    -
>
>    “unnecessary existence check before inserting a value into a std::map”.
>
>
> Some of these use cases are covered by existing ClangTidy checks that are
> implemented with pattern matching. We believe there is an opportunity to
> simplify their implementation by refactoring them to use the dataflow
> framework, and simultaneously improve the coverage and precision of how
> they model C++ semantics.
> Overview
>
> The initial version of our framework provides the following key features:
>
>    -
>
>    a forward dataflow algorithm, parameterized by a user-defined value
>    domain and a transfer function, which describes how program statements
>    modify domain values.
>    -
>
>    a built-in model of local variables and the heap. This model provides
>    two benefits to users: it tracks the flow of values, accounting for a wide
>    variety of C++ language constructs, in a way that is reusable across many
>    different analyses; it discriminates value approximations based on path
>    conditions, allowing for higher precision than dataflow alone.
>    -
>
>    an interface to an SMT solver to verify safety of operations like
>    dereferencing a pointer,
>    -
>
>    test infrastructure for validating results of the dataflow algorithm.
>
> Dataflow Algorithm
>
> Our framework solves forward dataflow equations over a product of a
> user-defined value domain and a built-in model of local variables and the
> heap, called the environment. The user implements the transfer function
> for their value domain, while the framework implements the transfer
> function that operates on the environment, only requiring a user-defined
> widening operator to avoid infinite growth of the environment model. We
> intend to expand the framework to also support backwards dataflow equations
> at some point in the future. The limitation to the forward direction is not
> essential.
>
> In more detail, the algorithm is parameterized by:
>
>
>    -
>
>    a user-defined, finite-height partial order, with a join function and
>    maximal element,
>    -
>
>    an initial element from the user-defined lattice,
>    -
>
>    a (monotonic) transfer function for statements, which maps between
>    lattice elements,
>    -
>
>    comparison and merge functions for environment values.
>
>
What kind of control will the user have over the environment? E.g. could we
add checks/extensions that model the effects of certain functions (e.g.
std::swap) in the environment?

>
> We expand on the first three elements here and return to the comparison
> and merge functions in the next section.
> The Lattice and Analysis Abstractions
>
> We start with a representation of a lattice:
>
> class BoundedJoinSemiLattice {
>
>  public:
>
>   friend bool operator==(const BoundedJoinSemiLattice& lhs,
>
>                          const BoundedJoinSemiLattice& rhs);
>
>   friend bool operator<=(const BoundedJoinSemiLattice& lhs,
>
>                          const BoundedJoinSemiLattice& rhs);
>
>   static BoundedJoinSemiLattice Top();
>
>   enum class JoinEffect {
>
>     Changed,
>
>     Unchanged,
>
>   };
>
>   JoinEffect Join(const BoundedJoinSemiLattice& element);
>
> };
>
> Not all of these declarations are necessary. Specifically,  operator==
> and operator<= can be implemented in terms of Join and Top isn’t used by
> the dataflow algorithm. We include the operators because direct
> implementations are likely more efficient, and Top because it provides
> evidence of a well-founded bounded join semi-lattice.
>
> The Join operation is unusual in its choice of mutating the object
> instance and returning an indication of whether the join resulted in any
> changes. A key step of the fixpoint algorithm is code like j = Join(x,
> y); if (j == x) …. This formulation of Join allows us to express this
> code instead as if (x.Join(y) == Unchanged).... For lattices whose
> instances are expensive to create or compare, we expect this formulation to
> improve the performance of the dataflow analysis.
>
> Based on the lattice abstraction, our analysis is similarly traditional:
>
> class DataflowAnalysis {
>
>  public:
>
>   using BoundedJoinSemiLattice = …;
>
>   BoundedJoinSemiLattice InitialElement();
>
>   BoundedJoinSemiLattice TransferStmt(
>
>       const clang::Stmt* stmt, const BoundedJoinSemiLattice& element,
>
>       clang::dataflow::Environment& environment);
>
> };
>
>
Do you plan to provide building blocks to help users build lattices? E.g.,
I really like SPARTA <https://github.com/facebook/SPARTA>. They added some
helpers to encode finite abstract domains, see SPARTA/FiniteAbstractDomain.h
at main · facebook/SPARTA (github.com)
<https://github.com/facebook/SPARTA/blob/main/include/FiniteAbstractDomain.h>
for
details.

> Environment: A Built-in Lattice
>
> All user-defined operations have access to an environment, which
> encapsulates the program context of whatever program element is being
> considered. It contains:
>
>    -
>
>    a path condition,
>    -
>
>    a storage model (which models both local variables and the heap).
>
>
> The built-in transfer functions model the effects of C++ language
> constructs (for example, variable declarations, assignments, pointer
> dereferences, arithmetic etc.) by manipulating the environment, saving the
> user from doing so in custom transfer functions.
>
> The path condition accumulates the set of boolean conditions that are
> known to be true on every path from function entry to the current program
> point.
>

I am a bit opposed to using the term path-condition, as users might confuse
this with similar terms used in path-sensitive analyses. What about
flow-condition?

>
> The environment maintains the maps from program declarations and pointer
> values to storage locations. It also maintains the map of storage locations
> to abstract values. Storage locations can be atomic (“scalar”) or aggregate
> multiple sublocations:
>
>    -
>
>    ScalarStorageLocation is a storage location that is not subdivided
>    further for the purposes of abstract interpretation, for example bool,
>    int*.
>    -
>
>    AggregateStorageLocation is a storage location which can be subdivided
>    into smaller storage locations that can be independently tracked by
>    abstract interpretation. For example, a struct with public members.
>
>
Could you elaborate on the storage model? How do you plan to deal with
aliasing? How do you represent arrays (with potentially unknown lengths)?

>
> In addition to this storage structure, our model tracks three kinds of
> values: basic values (like integers and booleans), pointers (which reify
> storage location into the value space) and records with named fields, where
> each field is itself a value. For aggregate values, we ensure coherence
> between the structure represented in storage locations and that represented
> in the values: for a given aggregate storage location Lagg with child f:
>
>    -
>
>    valueAt(Lagg) is a record, and
>    -
>
>    valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),
>
> where valueAt maps storage locations to values.
>
> In our first iteration, the only basic values that we will track are
> boolean values. Additional values and variables may be relevant, but only
> in so far as they constitute part of a boolean expression that we care
> about. For example, if (x > 5) { … } will result in our tracking x > 5,
> but not any particular knowledge about the integer-valued variable x.
>
> The path conditions and the model refer to the same sets of values and
> locations.
> Accounting for infinite height
>
> Despite the simplicity of our storage model, its elements can grow
> infinitely large: for example, path conditions on loop back edges, records
> when modeling linked lists, and even just the number of locations for a
> simple for-loop.  Therefore, we provide the user with two opportunities to
> interpret the abstract values so as to bound it: comparison of abstract
> values and a merge operation at control-flow joins.
>
>   // Compares values from two different environments for semantic
> equivalence.
>
>   bool Compare(
>
>       clang::dataflow::Value* value1, clang::dataflow::Environment& env1,
>
>       clang::dataflow::Value* value2, clang::dataflow::Environment& env2);
>
>   // Given `value1` (w.r.t. `env1`) and `value2` (w.r.t. `env2`), returns
> a
>
>   // `Value` (w.r.t. `merged_env`) that approximates `value1` and
> `value2`. This
>
>   // could be a strict lattice join, or a more general widening operation.
>
>   clang::dataflow::Value* Merge(
>
>       clang::dataflow::Value* value1, clang::dataflow::Environment& env1,
>
>       clang::dataflow::Value* value2, clang::dataflow::Environment& env2,
>
>       clang::dataflow::Environment& merged_env);
> Verification by SAT Solver
>
> The path conditions are critical in refining our approximation of the
> local variables and heap. At an abstract level, they don’t need to directly
> influence the execution of the dataflow analysis. Instead, we can imagine
> that, once concluded, our results are annotated with/incorporate path
> conditions, which in turn allows us to make more precise conclusions about
> our program from the analysis results.
>
> For example, if the analysis concludes that a variable conditionally holds
> a non-null pointer, and, for a given program point, the path condition
> implies said condition, we can conclude that a dereference of the pointer
> at that point is safe. In this sense, the verification of the safety of a
> deference is separate from the analysis, which models memory states with
> formulae. That said, invocation of the solver need not wait until after the
> analysis has concluded. It can also be used during the analysis, for
> example, in defining the widening operation in Merge.
>
> With that, we can see the role of an SAT solver: to solve questions of the
> form “Does the path condition satisfy the pre-condition computed for this
> pointer variable at this program point”? To that end, we include an API for
> asking such questions of a SAT solver. The API is compatible with Z3 and we
> also include our own (simple) SAT solver, for users who may not want to
> integrate with Z3.
>

What representation do you use for the path condition? Using the syntactic
forms, like `x > 5` might not be sufficient as `x` might have different
values at different program points.  E.g.,

void f(int x)
{
  assert(x > 0);
  ...
  x = -1;
  funcWithPreconditionXIsPositive(x); // Wants to execute the SAT query
here.
}

> Testing dataflow analysis
>
> We provide test infrastructure that allows users to easily write tests
> that make assertions about a program state computed by dataflow analysis at
> any code point. Code is annotated with labels on points of interest and
> then test expectations are written with respect to the labeled points.
>
> For example, consider an analysis that computes the state of optional
> values.
>
> ExpectDataflowAnalysis<OptionalChecker>(R"(
>
> void f(std::optional<int> opt) {
>
>   if (opt.has_value()) {
>
>     // [[true-branch]]
>
>   } else {
>
>     // [[false-branch]]
>
>   }
>
> })", [](const auto& results) {
>
>   EXPECT_LE(results.ValueAtCodePoint("true-branch"), OptionalLattice::
> Engaged());
>
>   EXPECT_LE(results.ValueAtCodePoint("false-branch"), OptionalLattice::
> NullOpt());
>
> });
>
> Timeline
>
> We have a fully working implementation of the framework as proposed here.
> We also are interested in contributing a clang-tidy checker for
> std::optional and related types as soon as possible, given its
> demonstrated efficacy and interest from third parties.  To that end, as
> soon as we have consensus for starting this work in the clang repository,
> we plan to start sending patches for review. Given that the framework is
> relatively small, we expect it will fit within 5 (largish) patches, meaning
> we can hope to have it available by December 2021.
> Future Work
>
> We consider our first version of the framework as experimental. It suits
> the needs of a number of particular checks that we’ve been writing, but it
> has some obvious limitations with respect to general use.  After we commit
> our first version, here are some planned areas of improvement and
> investigation.
>
> Backwards analyses. The framework only solves forward dataflow equations.
> This clearly prevents implementing some very familiar and useful analyses,
> like variable liveness. We intend to expand the framework to also support
> backwards dataflow equations. There’s nothing inherent to our framework
> that limits it to forwards.
>
> Reasoning about non-boolean values. Currently, we can symbolically model
> equality of booleans, but nothing else. Yet, for locations on the store
> that haven’t changed, we can understand which values are identical (because
> the input value object is passed by the transfer function to the output).
> For example,
>
> void f(bool *a, bool *b) {
>
>     std::optional<int> opt;
>
>     if (a == b) { opt = 42; }
>
>     if (a == b) { opt.value(); }
>
> }
>
> Since neither a nor b have changed, we can reason that the two
> occurrences of the expression a == b evaluate to the same value (even
> though we don't know the exact value). We intend to improve the framework
> to support this increased precision, given that it will not cost us any
> additional complexity.
>
> Improved model of program state. A major obstacle to writing high-quality
> checks for C++ code is the complexity of C++’s semantics.  In writing our
> framework, we’ve come to see the value of shielding the user from the need
> to handle these directly by modeling program state and letting the user
> express their check in terms of that model, after it has been solved for
> each program point. As is, our model currently supports a number of checks,
> including the std::optional check and another that infers which pointer
> parameters are out parameters. We intend to expand the set of such checks,
> with improvements to the model and the API for expressing domain-specific
> semantics relevant to a check. Additionally, we are considering explicitly
> tracking other abstract program properties, like reads and writes to local
> variables and memory, so that users need not account for all the syntactic
> variants of these operations.
>
> Refactoring existing checks. Once the framework is more mature and
> stable, we would like to refactor existing checks that use custom, ad-hoc
> implementations of dataflow to use this framework.
>

Is inter-procedural analysis out of scope (e.g., using summaries)?

> Conclusion
>
> We have proposed an addition to the Clang repository of a new static
> analysis framework targeting all-paths analyses. The intended clients of
> such analyses are code-checking and transformation tools like clang-tidy,
> where sound analyses can support sound transformations of code. We
> discussed the key features and APIs of our framework and how they fit
> together into a new analysis framework. Moreover, our proposal is based on
> experience with a working implementation that supports a number of
> practical, high-precision analyses, including one for safe use of types
> like std::optional and another that finds missed opportunities for use of
> std::move.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211012/cdbe7a53/attachment-0001.html>