[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Tue Oct 12 14:34:13 PDT 2021

Sounds like the mail didn't hit the mailing list due to size 
restriction. I'm quoting in plain text now to make sure it definitely 
hits the list in some form.

Like Gabor already said, It's amazing that this is happening. I 
completely agree with your goals and I think we've discussed a lot of 
this offline earlier. The amount of existing flow-sensitive warnings may 
be low but it's steadily growing, Valeriy's -Wcompletion-handler being 
one of the notable latest additions - a backwards analysis over finite 
state lattice with a sophisticated system of notes attached to the 
warning to explain what's happening (generalizing over the latter may 
possibly be a goal for your framework as well).

Aside from the ability to cover *the other 50% of warnings* compared to 
what the static analyzer already covers (may-problems vs. must-problems 
aka all-paths problems), another strong benefit of flow sensitive 
warnings is their performance which allows us to use them as compiler 
warnings (as opposed to separate tools like clang-tidy or static 
analyzer). This in turn delivers the warning to every user of the 
compiler which increases the user coverage dramatically. This 
consideration may even apply to may-problems in which the static 
analyzer already excels technically.

I'm also very curious about your SAT solver implementation; I have a 
recent hobby of plugging various solvers into the static analyzer (cf. 
https://reviews.llvm.org/D110125) because it obviously needs some - 
arguably as much as your endeavor.

On 10/12/21 8:32 AM, Yitzhak Mandelbaum wrote:
> Summary
>
> We propose a new static analysis framework for implementing Clang 
> bug-finding and refactoring tools (including ClangTidy checks) based 
> on a dataflow algorithm.
>
>
> Some ClangTidy checks and core compiler warnings are already 
> dataflowbased. However, due to a lack of a reusable framework, all 
> reimplement common dataflow-analysis and CFG-traversal algorithms. In 
> addition to supporting new checks, the framework should allow us to 
> clean up those parts of Clang and ClangTidy.
>
>
> We intend that the new framework should have a learning curve similar 
> to AST matchers and not require prior expertise in static analysis. 
> Our goal is a simple API that isolates the user from the dataflow 
> algorithm itself. The initial version of the framework combines 
> support for classic dataflow analysesover the Clang CFG with symbolic 
> evaluation of the code and a simple model of local variables and the 
> heap. A number of useful checks can be expressed solely in terms of 
> this model, saving their implementers from having to reimplement 
> modeling of the semantics of C++ AST nodes. Ultimately, we want to 
> extend this convenience to a wide variety of checks, so that they can 
> be written in terms of the abstract semantics of the program (as 
> modeled by the framework), while only matching and modeling AST nodes 
> for key parts of their domain (e.g., custom data types or functions).
>
>
>   Motivation
>
> Clang has a few subsystems that greatly help with implementing 
> bug-finding and refactoring tools:
>
>  *
>
>     AST matchersallow one to easily find program fragments whose AST
>     matches a given pattern. This works well for finding program
>     fragments whose desired semantics translates to a small number of
>     possible syntactic expressions (e.g., "find a call to
>     strlen()where the argument is a null pointer constant"). Note that
>     patterns expressed by AST matchers often check program properties
>     that hold for allexecution paths.
>
>  *
>
>     Clang Static Analyzercan find program paths that end in a state
>     that satisfies a specific user-defined property, regardless of how
>     the code was expressed syntactically (e.g., "find a call to
>     strlen()where the argument is dynamically equal to null"). CSA
>     abstracts from the syntax of the code, letting analyses focus on
>     its semantics. While not completely eliding syntax, symbolic
>     evaluation abstracts over many syntactic details. Semantically
>     equivalent program fragments result in equivalent program states
>     computed by the CSA.
>
>
> When implementing certain bug-finding and refactoring tools, it is 
> often necessary to find program fragments where a given property holds 
> on all execution paths, while placing as few demands as possible on 
> how the user writes their code. The proposed dataflow analysis 
> framework aims to satisfy both of these needs simultaneously.
>
>
>     Example: Ensuring that std::optionalis used safely
>
> Consider implementing a static analysis tool that proves that every 
> time a std::optionalis unwrapped, the program has ensured that the 
> precondition of the unwrap operation (the optional has a value) is 
> satisfied.
>
>
> voidTestOK(std::optional<int>x){
>
> if(x.has_value()){
>
> use(x.value());
>
> }
>
> }
>
>
> voidTestWithWarning(std::optional<int>x){
>
> use(x.value());// warning: unchecked access to optional value
>
> }
>
>
> If we implemented this check in CSA, it would likely find the bug in 
> TestWithWarning, and would not report anything in TestOK. However, CSA 
> did not prove the desired property (all optional unwraps are checked 
> on all paths) for TestOK; CSA just could not find a path through 
> TestOKwhere this property is violated.
>
>
> A simple function like TestOKonly has two paths and would be 
> completely explored by the CSA. However, when loops are involved, CSA 
> uses heuristics to decide when to stop exploring paths in a function. 
> As a result, CSA would miss this bug:
>
>
> voidMissedBugByCSA(std::optional<int>x){
>
> for(inti =0;i <10;i++){
>
> if(i <5){
>
> use(i);
>
> }else{
>
> use(x.value());// CSA does not report a warning
>
> }
>
> }
>
> }
>
>
> If we implemented this check with AST matchers (as typically done in 
> ClangTidy), we would have to enumerate way too many syntactic 
> patterns. Consider just a couple coding variations that users expect 
> this analysis to understand to be safe:
>
>
> voidTestOK(std::optional<int>x){
>
> if(x){
>
> use(x.value());
>
> }
>
> }
>
>
> voidTestOK(std::optional<int>x){
>
> if(x.has_value()==true){
>
> use(x.value());
>
> }
>
> }
>
>
> voidTestOK(std::optional<int>x){
>
> if(!x.has_value()){
>
> use(0);
>
> }else{
>
> use(x.value());
>
> }
>
> }
>
>
> voidTestOK(std::optional<int>x){
>
> if(!x.has_value())
>
> return;
>
> use(x.value());
>
> }
>
>
> To summarize, CSA can find someoptional-unwrapping bugs, but it does 
> not prove that the code performs a check on allpaths. AST matchers can 
> prove it for a finite set of hard-coded syntactic cases. But, when the 
> set of safe patterns is infinite, as is often the case, AST matchers 
> can only capture a subset. Capturing the full set requires building a 
> framework like the one presented in this document.
>
>
>     Comparison
>
>
> 	
>
> AST Matchers Library
>
> 	
>
> Clang Static Analyzer
>
> 	
>
> Dataflow Analysis Framework
>
> Proves that a property holds on all execution paths
>
> 	
>
> Possible, depends on the matcher
>
> 	
>
> No
>
> 	
>
> Yes
>
> Checks can be written in terms of program semantics
>
> 	
>
> No
>
> 	
>
> Yes
>
> 	
>
> Yes
>
>
>     Further Examples
>
> While we have started with a std::optionalchecker, we have a number of 
> further use cases where this framework can help. Note that some of 
> these are already implemented in complex, specialized checks, either 
> directly in the compiler or as clang-tidy checks. A key goal of our 
> framework is to simplify the implementation of such checks in the future.
>
>  *
>
>     “the pointer may be null when dereferenced”, “the pointer is
>     always null when dereferenced”,
>
>  *
>
>     "unnecessary null check on a pointer passed to delete/free",
>
>  *
>
>     “this raw pointer variable could be refactored into a
>     std::unique_ptr”,
>
>  *
>
>     “expression always evaluates to a constant”, “if condition always
>     evaluates to true/false”,
>
>  *
>
>     “basic block is never executed”, “loop is executed at most once”,
>
>  *
>
>     “assigned value is always overwritten”,
>
>  *
>
>     “value was used after having been moved”,
>
>  *
>
>     “value guarded by a lock was used without taking the lock”,
>
>  *
>
>     “unnecessary existence check before inserting a value into a
>     std::map”.
>
>
> Some of these use cases are covered by existing ClangTidy checks that 
> are implemented with pattern matching. We believe there is an 
> opportunity to simplify their implementation by refactoring them to 
> use the dataflow framework, and simultaneously improve the coverage 
> and precision of how they model C++ semantics.
>
>
>   Overview
>
> The initial version of our framework provides the following key features:
>
>  *
>
>     a forward dataflow algorithm, parameterized by a user-defined
>     value domain and a transfer function, which describes how program
>     statements modify domain values.
>
>  *
>
>     a built-in model of local variables and the heap. This model
>     provides two benefits to users: it tracks the flow of values,
>     accounting for a wide variety of C++ language constructs, in a way
>     that is reusable across many different analyses; it discriminates
>     value approximations based on path conditions, allowing for higher
>     precision than dataflow alone.
>
>  *
>
>     an interface to an SMT solver to verify safety of operations like
>     dereferencing a pointer,
>
>  *
>
>     test infrastructure for validating results of the dataflow algorithm.
>
>
>     Dataflow Algorithm
>
> Our framework solves forward dataflow equations over a product of a 
> user-defined value domain and a built-in model of local variables and 
> the heap, called the environment. The user implements the transfer 
> function for their value domain, while the framework implements the 
> transfer function that operates on the environment, only requiring a 
> user-defined widening operator to avoid infinite growth of the 
> environment model. We intend to expand the framework to also support 
> backwards dataflow equations at some point in the future. The 
> limitation to the forward direction is not essential.
>
>
> In more detail, the algorithm is parameterized by:
>
>
>  *
>
>     a user-defined, finite-height partial order, with a join function
>     and maximal element,
>
>  *
>
>     an initial element from the user-defined lattice,
>
>  *
>
>     a (monotonic) transfer function for statements, which maps between
>     lattice elements,
>
>  *
>
>     comparison and merge functions for environment values.
>
>
> We expand on the first three elements here and return to the 
> comparison and merge functions in the next section.
>
>
>       The Lattice and Analysis Abstractions
>
> We start with a representation of a lattice:
>
> classBoundedJoinSemiLattice{
>
> public:
>
> friendbooloperator==(constBoundedJoinSemiLattice&lhs,
>
> constBoundedJoinSemiLattice&rhs);
>
> friendbooloperator<=(constBoundedJoinSemiLattice&lhs,
>
> constBoundedJoinSemiLattice&rhs);
>
> staticBoundedJoinSemiLatticeTop();
>
> enumclassJoinEffect{
>
> Changed,
>
> Unchanged,
>
> };
>
> JoinEffectJoin(constBoundedJoinSemiLattice&element);
>
> };
>
> Not all of these declarations are necessary. Specifically, 
> operator==and operator<=can be implemented in terms of Joinand 
> Topisn’t used by the dataflow algorithm. We include the operators 
> because direct implementations are likely more efficient, and 
> Topbecause it provides evidence of a well-founded bounded join 
> semi-lattice.
>
>
> The Joinoperation is unusual in its choice of mutating the object 
> instance and returning an indication of whether the join resulted in 
> any changes. A key step of the fixpoint algorithm is code like j = 
> Join(x, y); if (j == x) …. This formulation of Joinallows us to 
> express this code instead as if (x.Join(y) == Unchanged).... For 
> lattices whose instances are expensive to create or compare, we expect 
> this formulation to improve the performance of the dataflow analysis.
>
>
> Based on the lattice abstraction, our analysis is similarly traditional:
>
> classDataflowAnalysis{
>
> public:
>
> usingBoundedJoinSemiLattice=…;
>
> BoundedJoinSemiLatticeInitialElement();
>
> BoundedJoinSemiLatticeTransferStmt(
>
> constclang::Stmt*stmt,constBoundedJoinSemiLattice&element,
>
>       clang::dataflow::Environment&environment);
>
> };
>
>
>     Environment: A Built-in Lattice
>
> All user-defined operations have access to an environment, which 
> encapsulates the program context of whatever program element is being 
> considered. It contains:
>
>  *
>
>     a path condition,
>
>  *
>
>     a storage model (which models both local variables and the heap).
>
>
> The built-in transfer functions model the effects of C++ language 
> constructs (for example, variable declarations, assignments, pointer 
> dereferences, arithmetic etc.) by manipulating the environment, saving 
> the user from doing so in custom transfer functions.
>
>
> The path condition accumulates the set of boolean conditions that are 
> known to be true on every path from function entry to the current 
> program point.
>
>
> The environment maintains the maps from program declarations and 
> pointer values to storage locations. It also maintains the map of 
> storage locations to abstract values. Storage locations can be atomic 
> (“scalar”) or aggregate multiple sublocations:
>
>  *
>
>     ScalarStorageLocationis a storage location that is not subdivided
>     further for the purposes of abstract interpretation, for example
>     bool, int*.
>
>  *
>
>     AggregateStorageLocationis a storage location which can be
>     subdivided into smaller storage locations that can be
>     independently tracked by abstract interpretation. For example, a
>     struct with public members.
>
>
> In addition to this storage structure, our model tracks three kinds of 
> values: basic values (like integers and booleans), pointers (which 
> reify storage location into the value space) and records with named 
> fields, where each field is itself a value. For aggregate values, we 
> ensure coherence between the structure represented in storage 
> locations and that represented in the values: for a given aggregate 
> storage location Laggwith child f:
>
>  *
>
>     valueAt(Lagg) is a record, and
>
>  *
>
>     valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),
>
> where valueAtmaps storage locations to values.
>
>
> In our first iteration, the only basic values that we will track are 
> boolean values. Additional values and variables may be relevant, but 
> only in so far as they constitute part of a boolean expression that we 
> care about. For example, if (x > 5) { … }will result in our tracking x 
> > 5, but not any particular knowledge about the integer-valued variable x.
>
>
> The path conditions and the model refer to the same sets of values and 
> locations.
>
>
>       Accounting for infinite height
>
> Despite the simplicity of our storage model, its elements can grow 
> infinitely large: for example, path conditions on loop back edges, 
> records when modeling linked lists, and even just the number of 
> locations for a simple for-loop.  Therefore, we provide the user with 
> two opportunities to interpret the abstract values so as to bound it: 
> comparison of abstract values and a merge operation at control-flow joins.
>
>
> // Compares values from two different environments for semantic 
> equivalence.
>
> boolCompare(
>
>       clang::dataflow::Value*value1,clang::dataflow::Environment&env1,
>
>       clang::dataflow::Value*value2,clang::dataflow::Environment&env2);
>
>
> // Given `value1` (w.r.t. `env1`) and `value2` (w.r.t. `env2`), returns a
>
> // `Value` (w.r.t. `merged_env`) that approximates `value1` and 
> `value2`. This
>
> // could be a strict lattice join, or a more general widening operation.
>
>   clang::dataflow::Value*Merge(
>
>       clang::dataflow::Value*value1,clang::dataflow::Environment&env1,
>
>       clang::dataflow::Value*value2,clang::dataflow::Environment&env2,
>
>       clang::dataflow::Environment&merged_env);
>
>
>     Verification by SAT Solver
>
> The path conditions are critical in refining our approximation of the 
> local variables and heap. At an abstract level, they don’t need to 
> directly influence the execution of the dataflow analysis. Instead, we 
> can imagine that, once concluded, our results are annotated 
> with/incorporate path conditions, which in turn allows us to make more 
> precise conclusions about our program from the analysis results.
>
>
> For example, if the analysis concludes that a variable conditionally 
> holds a non-null pointer, and, for a given program point, the path 
> condition implies said condition, we can conclude that a dereference 
> of the pointer at that point is safe. In this sense, the verification 
> of the safety of a deference is separate from the analysis, which 
> models memory states with formulae. That said, invocation of the 
> solver need not wait until after the analysis has concluded. It can 
> also be used during the analysis, for example, in defining the 
> widening operation in Merge.
>
>
> With that, we can see the role of an SAT solver: to solve questions of 
> the form “Does the path condition satisfy the pre-condition computed 
> for this pointer variable at this program point”? To that end, we 
> include an API for asking such questions of a SAT solver. The API is 
> compatible with Z3 and we also include our own (simple) SAT solver, 
> for users who may not want to integrate with Z3.
>
>
>     Testing dataflow analysis
>
> We provide test infrastructure that allows users to easily write tests 
> that make assertions about a program state computed by dataflow 
> analysis at any code point. Code is annotated with labels on points of 
> interest and then test expectations are written with respect to the 
> labeled points.
>
>
> For example, consider an analysis that computes the state of 
> optionalvalues.
>
> ExpectDataflowAnalysis<OptionalChecker>(R"(
>
> void f(std::optional<int> opt) {
>
>   if (opt.has_value()) {
>
>     // [[true-branch]]
>
>   } else {
>
>     // [[false-branch]]
>
>   }
>
> })",[](constauto&results){
>
>   EXPECT_LE(results.ValueAtCodePoint("true-branch"),OptionalLattice::Engaged());
>
>   EXPECT_LE(results.ValueAtCodePoint("false-branch"),OptionalLattice::NullOpt());
>
> });
>
>
>   Timeline
>
> We have a fully working implementation of the framework as proposed 
> here. We also are interested in contributing a clang-tidy checker for 
> std::optionaland related types as soon as possible, given its 
> demonstrated efficacy and interest from third parties.  To that end, 
> as soon as we have consensus for starting this work in the clang 
> repository, we plan to start sending patches for review. Given that 
> the framework is relatively small, we expect it will fit within 5 
> (largish) patches, meaning we can hope to have it available by 
> December 2021.
>
>
>   Future Work
>
> We consider our first version of the framework as experimental. It 
> suits the needs of a number of particular checks that we’ve been 
> writing, but it has some obvious limitations with respect to general 
> use.  After we commit our first version, here are some planned areas 
> of improvement and investigation.
>
>
> Backwards analyses.The framework only solves forward dataflow 
> equations. This clearly prevents implementing some very familiar and 
> useful analyses, like variable liveness. We intend to expand the 
> framework to also support backwards dataflow equations. There’s 
> nothing inherent to our framework that limits it to forwards.
>
>
> Reasoning about non-boolean values.Currently, we can symbolically 
> model equality of booleans, but nothing else. Yet, for locations on 
> the store that haven’t changed, we can understand which values are 
> identical (because the input value object is passed by the transfer 
> function to the output). For example,
>
> voidf(bool*a,bool*b){
>
>     std::optional<int>opt;
>
> if(a ==b){opt =42;}
>
> if(a ==b){opt.value();}
>
> }
>
> Since neither anor bhave changed, we can reason that the two 
> occurrences of the expression a == bevaluate to the same value (even 
> though we don't know the exact value). We intend to improve the 
> framework to support this increased precision, given that it will not 
> cost us any additional complexity.
>
>
> Improved model of program state.A major obstacle to writing 
> high-quality checks for C++ code is the complexity of C++’s 
> semantics.  In writing our framework, we’ve come to see the value of 
> shielding the user from the need to handle these directly by modeling 
> program state and letting the user express their check in terms of 
> that model, after it has been solved for each program point. As is, 
> our model currently supports a number of checks, including the 
> std::optionalcheck and another that infers which pointer parameters 
> are out parameters. We intend to expand the set of such checks, with 
> improvements to the model and the API for expressing domain-specific 
> semantics relevant to a check. Additionally, we are considering 
> explicitly tracking other abstract program properties, like reads and 
> writes to local variables and memory, so that users need not account 
> for all the syntactic variants of these operations.
>
>
> Refactoring existing checks.Once the framework is more mature and 
> stable, we would like to refactor existing checks that use custom, 
> ad-hoc implementations of dataflow to use this framework.
>
>
>   Conclusion
>
> We have proposed an addition to the Clang repository of a new static 
> analysis framework targeting all-pathsanalyses. The intended clients 
> of such analyses are code-checking and transformation tools like 
> clang-tidy, where sound analyses can support sound transformations of 
> code. We discussed the key features and APIs of our framework and how 
> they fit together into a new analysis framework. Moreover, our 
> proposal is based on experience with a working implementation that 
> supports a number of practical, high-precision analyses, including one 
> for safe use of types like std::optionaland another that finds missed 
> opportunities for use of std::move.
>
>
>