[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Wed Oct 13 05:09:16 PDT 2021

>
> I'm also very curious about your SAT solver implementation; I have a
>
> recent hobby of plugging various solvers into the static analyzer (cf.
>
> https://reviews.llvm.org/D110125) because it obviously needs some -
>
> arguably as much as your endeavor.
>
>

Yes, we'd love to tell you more, but I'll have to defer to sgatev@, since
> he's the author and most familiar with the details.

Currently, we convert a path condition boolean formula to CNF and use a
basic DPLL <https://en.wikipedia.org/wiki/DPLL_algorithm> solver to check
validity. The solver is not really optimized to handle large formulae as it
performs some unnecessary copies in the recursive step, but this hasn't
been a problem so far as the analysis for safe use of std::optional doesn't
produce large formulae. Still, I believe we can improve the performance of
the solver while keeping the implementation lightweight. We're also working
on a more general interface that would allow plugging different solver
implementations in the framework.

Cheers,
Stanislav

On Wed, Oct 13, 2021 at 3:40 AM Yitzhak Mandelbaum <yitzhakm at google.com>
wrote:

> (for real this time)
>
> On Tue, Oct 12, 2021 at 9:39 PM Yitzhak Mandelbaum <yitzhakm at google.com>
> wrote:
>
>> -vsavchenko at apple.com (which keeps bouncing)
>>
>> On Tue, Oct 12, 2021 at 8:52 PM Yitzhak Mandelbaum <yitzhakm at google.com>
>> wrote:
>>
>>> Gabor, Artem,
>>>
>>> Thanks for your support, and, Artem, thank you for quoting as plain text
>>> -- you're right, the system shunted it to the moderator because of its size.
>>>
>>> To answer some of the questions/comments raised:
>>> > There is a nice summary of some of the ad-hoc solutions we have in
>>> Clang in this survey: [cfe-dev] A survey of dataflow analyses in Clang
>>> (llvm.org)
>>> <https://lists.llvm.org/pipermail/cfe-dev/2020-October/066937.html>
>>> > Do you plan to replace the ad-hoc solutions in the long run?
>>>
>>> We mention this briefly in the RFC as motivation, but to reiterate here:
>>> we'd like the framework to at least be *suitable* to replace these ad-hoc
>>> solutions. Whether we would get to this ourselves or encourage others,
>>> review patches, etc. I can't say at this point.  I suspect we'll refactor
>>> at least a few ourselves just to test out the robustness of the framework.
>>>
>>> Just to reiterate, you want to abstract away both from the dataflow
>>>> algorithm and the semantics of C++. E.g., the user might not need to handle
>>>> the myriad of ways someone can assign a value to a variable (overloaded
>>>> operator=, output argument, primitive assignment, etc), only a single
>>>> abstract concept of assignment.
>>>
>>>
>>> Exactly!  Our first version already does this to some extent, we just
>>> feel it's somewhat tailored to our current needs and could use some
>>> generalization.  For example, it would be great to know when an lvalue is
>>> read for the last time and written to another lvalue, because that would
>>> suggest a potential `std::move`, among other things. Yet, that balance of
>>> detail and abstraction is not yet available. We only model the *effects* of
>>> reads and writes without explicitly abstracting them (say, tagging each CFG
>>> node w/ the set of reads and writes).  So, we think there's room to explore
>>> the design space here, but also think what we have will be valuable to a
>>> number of analyses right away.
>>>
>>> While most of the checks in CSA are using symbolic execution, there are
>>>> also some syntactic rules. One dilemma of implementing a new check is where
>>>> to put it (CSA? Tidy? Warnings?). While it would be great to get rid of
>>>> this dilemma, I think we want to keep all of the options open. I just
>>>> wanted to make it explicit what kinds of layering constraints we have. It
>>>> should be possible to add CSA checks/features using the new dataflow
>>>> framework.
>>>
>>>
>>> Great point! We hadn't thought about this possibility, but there's no
>>> good reason to rule out use in CSA.  Our key layering constraint is that it
>>> be available in clang-tidy. So, if we put it somewhere central, like
>>> clang/include/clang/Analysis, would that be suitable for use in CSA? We're
>>> certainly open to other suggestions as well.
>>>
>>> Valeriy's -Wcompletion-handler being
>>>> one of the notable latest additions - a backwards analysis over finite
>>>> state lattice with a sophisticated system of notes attached to the
>>>> warning to explain what's happening (generalizing over the latter may
>>>> possibly be a goal for your framework as well).
>>>
>>>
>>> Yes, we're definitely interested in being able to produce warnings that
>>> give the user a clear idea of what is wrong and (if possible) how to fix
>>> it.  This is complicated once we involve the SAT solver, since it tends to
>>> give us simple yes/no answers. But, actionable warnings seem crucial for
>>> successful user adoption.
>>>
>>> I'm also very curious about your SAT solver implementation; I have a
>>>> recent hobby of plugging various solvers into the static analyzer (cf.
>>>> https://reviews.llvm.org/D110125) because it obviously needs some -
>>>> arguably as much as your endeavor.
>>>
>>>
>>> Yes, we'd love to tell you more, but I'll have to defer to sgatev@,
>>> since he's the author and most familiar with the details.
>>>
>>> Cheers,
>>> Yitzhak
>>>
>>>
>>> On Tue, Oct 12, 2021 at 5:34 PM Artem Dergachev <noqnoqneo at gmail.com>
>>> wrote:
>>>
>>>> Sounds like the mail didn't hit the mailing list due to size
>>>> restriction. I'm quoting in plain text now to make sure it definitely
>>>> hits the list in some form.
>>>>
>>>> Like Gabor already said, It's amazing that this is happening. I
>>>> completely agree with your goals and I think we've discussed a lot of
>>>> this offline earlier. The amount of existing flow-sensitive warnings
>>>> may
>>>> be low but it's steadily growing, Valeriy's -Wcompletion-handler being
>>>> one of the notable latest additions - a backwards analysis over finite
>>>> state lattice with a sophisticated system of notes attached to the
>>>> warning to explain what's happening (generalizing over the latter may
>>>> possibly be a goal for your framework as well).
>>>>
>>>> Aside from the ability to cover *the other 50% of warnings* compared to
>>>> what the static analyzer already covers (may-problems vs. must-problems
>>>> aka all-paths problems), another strong benefit of flow sensitive
>>>> warnings is their performance which allows us to use them as compiler
>>>> warnings (as opposed to separate tools like clang-tidy or static
>>>> analyzer). This in turn delivers the warning to every user of the
>>>> compiler which increases the user coverage dramatically. This
>>>> consideration may even apply to may-problems in which the static
>>>> analyzer already excels technically.
>>>>
>>>> I'm also very curious about your SAT solver implementation; I have a
>>>> recent hobby of plugging various solvers into the static analyzer (cf.
>>>> https://reviews.llvm.org/D110125) because it obviously needs some -
>>>> arguably as much as your endeavor.
>>>>
>>>>
>>>> On 10/12/21 8:32 AM, Yitzhak Mandelbaum wrote:
>>>> > Summary
>>>> >
>>>> > We propose a new static analysis framework for implementing Clang
>>>> > bug-finding and refactoring tools (including ClangTidy checks) based
>>>> > on a dataflow algorithm.
>>>> >
>>>> >
>>>> > Some ClangTidy checks and core compiler warnings are already
>>>> > dataflowbased. However, due to a lack of a reusable framework, all
>>>> > reimplement common dataflow-analysis and CFG-traversal algorithms. In
>>>> > addition to supporting new checks, the framework should allow us to
>>>> > clean up those parts of Clang and ClangTidy.
>>>> >
>>>> >
>>>> > We intend that the new framework should have a learning curve similar
>>>> > to AST matchers and not require prior expertise in static analysis.
>>>> > Our goal is a simple API that isolates the user from the dataflow
>>>> > algorithm itself. The initial version of the framework combines
>>>> > support for classic dataflow analysesover the Clang CFG with symbolic
>>>> > evaluation of the code and a simple model of local variables and the
>>>> > heap. A number of useful checks can be expressed solely in terms of
>>>> > this model, saving their implementers from having to reimplement
>>>> > modeling of the semantics of C++ AST nodes. Ultimately, we want to
>>>> > extend this convenience to a wide variety of checks, so that they can
>>>> > be written in terms of the abstract semantics of the program (as
>>>> > modeled by the framework), while only matching and modeling AST nodes
>>>> > for key parts of their domain (e.g., custom data types or functions).
>>>> >
>>>> >
>>>> >   Motivation
>>>> >
>>>> > Clang has a few subsystems that greatly help with implementing
>>>> > bug-finding and refactoring tools:
>>>> >
>>>> >  *
>>>> >
>>>> >     AST matchersallow one to easily find program fragments whose AST
>>>> >     matches a given pattern. This works well for finding program
>>>> >     fragments whose desired semantics translates to a small number of
>>>> >     possible syntactic expressions (e.g., "find a call to
>>>> >     strlen()where the argument is a null pointer constant"). Note that
>>>> >     patterns expressed by AST matchers often check program properties
>>>> >     that hold for allexecution paths.
>>>> >
>>>> >  *
>>>> >
>>>> >     Clang Static Analyzercan find program paths that end in a state
>>>> >     that satisfies a specific user-defined property, regardless of how
>>>> >     the code was expressed syntactically (e.g., "find a call to
>>>> >     strlen()where the argument is dynamically equal to null"). CSA
>>>> >     abstracts from the syntax of the code, letting analyses focus on
>>>> >     its semantics. While not completely eliding syntax, symbolic
>>>> >     evaluation abstracts over many syntactic details. Semantically
>>>> >     equivalent program fragments result in equivalent program states
>>>> >     computed by the CSA.
>>>> >
>>>> >
>>>> > When implementing certain bug-finding and refactoring tools, it is
>>>> > often necessary to find program fragments where a given property
>>>> holds
>>>> > on all execution paths, while placing as few demands as possible on
>>>> > how the user writes their code. The proposed dataflow analysis
>>>> > framework aims to satisfy both of these needs simultaneously.
>>>> >
>>>> >
>>>> >     Example: Ensuring that std::optionalis used safely
>>>> >
>>>> > Consider implementing a static analysis tool that proves that every
>>>> > time a std::optionalis unwrapped, the program has ensured that the
>>>> > precondition of the unwrap operation (the optional has a value) is
>>>> > satisfied.
>>>> >
>>>> >
>>>> > voidTestOK(std::optional<int>x){
>>>> >
>>>> > if(x.has_value()){
>>>> >
>>>> > use(x.value());
>>>> >
>>>> > }
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > voidTestWithWarning(std::optional<int>x){
>>>> >
>>>> > use(x.value());// warning: unchecked access to optional value
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > If we implemented this check in CSA, it would likely find the bug in
>>>> > TestWithWarning, and would not report anything in TestOK. However,
>>>> CSA
>>>> > did not prove the desired property (all optional unwraps are checked
>>>> > on all paths) for TestOK; CSA just could not find a path through
>>>> > TestOKwhere this property is violated.
>>>> >
>>>> >
>>>> > A simple function like TestOKonly has two paths and would be
>>>> > completely explored by the CSA. However, when loops are involved, CSA
>>>> > uses heuristics to decide when to stop exploring paths in a function.
>>>> > As a result, CSA would miss this bug:
>>>> >
>>>> >
>>>> > voidMissedBugByCSA(std::optional<int>x){
>>>> >
>>>> > for(inti =0;i <10;i++){
>>>> >
>>>> > if(i <5){
>>>> >
>>>> > use(i);
>>>> >
>>>> > }else{
>>>> >
>>>> > use(x.value());// CSA does not report a warning
>>>> >
>>>> > }
>>>> >
>>>> > }
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > If we implemented this check with AST matchers (as typically done in
>>>> > ClangTidy), we would have to enumerate way too many syntactic
>>>> > patterns. Consider just a couple coding variations that users expect
>>>> > this analysis to understand to be safe:
>>>> >
>>>> >
>>>> > voidTestOK(std::optional<int>x){
>>>> >
>>>> > if(x){
>>>> >
>>>> > use(x.value());
>>>> >
>>>> > }
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > voidTestOK(std::optional<int>x){
>>>> >
>>>> > if(x.has_value()==true){
>>>> >
>>>> > use(x.value());
>>>> >
>>>> > }
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > voidTestOK(std::optional<int>x){
>>>> >
>>>> > if(!x.has_value()){
>>>> >
>>>> > use(0);
>>>> >
>>>> > }else{
>>>> >
>>>> > use(x.value());
>>>> >
>>>> > }
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > voidTestOK(std::optional<int>x){
>>>> >
>>>> > if(!x.has_value())
>>>> >
>>>> > return;
>>>> >
>>>> > use(x.value());
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> > To summarize, CSA can find someoptional-unwrapping bugs, but it does
>>>> > not prove that the code performs a check on allpaths. AST matchers
>>>> can
>>>> > prove it for a finite set of hard-coded syntactic cases. But, when
>>>> the
>>>> > set of safe patterns is infinite, as is often the case, AST matchers
>>>> > can only capture a subset. Capturing the full set requires building a
>>>> > framework like the one presented in this document.
>>>> >
>>>> >
>>>> >     Comparison
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > AST Matchers Library
>>>> >
>>>> >
>>>> >
>>>> > Clang Static Analyzer
>>>> >
>>>> >
>>>> >
>>>> > Dataflow Analysis Framework
>>>> >
>>>> > Proves that a property holds on all execution paths
>>>> >
>>>> >
>>>> >
>>>> > Possible, depends on the matcher
>>>> >
>>>> >
>>>> >
>>>> > No
>>>> >
>>>> >
>>>> >
>>>> > Yes
>>>> >
>>>> > Checks can be written in terms of program semantics
>>>> >
>>>> >
>>>> >
>>>> > No
>>>> >
>>>> >
>>>> >
>>>> > Yes
>>>> >
>>>> >
>>>> >
>>>> > Yes
>>>> >
>>>> >
>>>> >     Further Examples
>>>> >
>>>> > While we have started with a std::optionalchecker, we have a number
>>>> of
>>>> > further use cases where this framework can help. Note that some of
>>>> > these are already implemented in complex, specialized checks, either
>>>> > directly in the compiler or as clang-tidy checks. A key goal of our
>>>> > framework is to simplify the implementation of such checks in the
>>>> future.
>>>> >
>>>> >  *
>>>> >
>>>> >     “the pointer may be null when dereferenced”, “the pointer is
>>>> >     always null when dereferenced”,
>>>> >
>>>> >  *
>>>> >
>>>> >     "unnecessary null check on a pointer passed to delete/free",
>>>> >
>>>> >  *
>>>> >
>>>> >     “this raw pointer variable could be refactored into a
>>>> >     std::unique_ptr”,
>>>> >
>>>> >  *
>>>> >
>>>> >     “expression always evaluates to a constant”, “if condition always
>>>> >     evaluates to true/false”,
>>>> >
>>>> >  *
>>>> >
>>>> >     “basic block is never executed”, “loop is executed at most once”,
>>>> >
>>>> >  *
>>>> >
>>>> >     “assigned value is always overwritten”,
>>>> >
>>>> >  *
>>>> >
>>>> >     “value was used after having been moved”,
>>>> >
>>>> >  *
>>>> >
>>>> >     “value guarded by a lock was used without taking the lock”,
>>>> >
>>>> >  *
>>>> >
>>>> >     “unnecessary existence check before inserting a value into a
>>>> >     std::map”.
>>>> >
>>>> >
>>>> > Some of these use cases are covered by existing ClangTidy checks that
>>>> > are implemented with pattern matching. We believe there is an
>>>> > opportunity to simplify their implementation by refactoring them to
>>>> > use the dataflow framework, and simultaneously improve the coverage
>>>> > and precision of how they model C++ semantics.
>>>> >
>>>> >
>>>> >   Overview
>>>> >
>>>> > The initial version of our framework provides the following key
>>>> features:
>>>> >
>>>> >  *
>>>> >
>>>> >     a forward dataflow algorithm, parameterized by a user-defined
>>>> >     value domain and a transfer function, which describes how program
>>>> >     statements modify domain values.
>>>> >
>>>> >  *
>>>> >
>>>> >     a built-in model of local variables and the heap. This model
>>>> >     provides two benefits to users: it tracks the flow of values,
>>>> >     accounting for a wide variety of C++ language constructs, in a way
>>>> >     that is reusable across many different analyses; it discriminates
>>>> >     value approximations based on path conditions, allowing for higher
>>>> >     precision than dataflow alone.
>>>> >
>>>> >  *
>>>> >
>>>> >     an interface to an SMT solver to verify safety of operations like
>>>> >     dereferencing a pointer,
>>>> >
>>>> >  *
>>>> >
>>>> >     test infrastructure for validating results of the dataflow
>>>> algorithm.
>>>> >
>>>> >
>>>> >     Dataflow Algorithm
>>>> >
>>>> > Our framework solves forward dataflow equations over a product of a
>>>> > user-defined value domain and a built-in model of local variables and
>>>> > the heap, called the environment. The user implements the transfer
>>>> > function for their value domain, while the framework implements the
>>>> > transfer function that operates on the environment, only requiring a
>>>> > user-defined widening operator to avoid infinite growth of the
>>>> > environment model. We intend to expand the framework to also support
>>>> > backwards dataflow equations at some point in the future. The
>>>> > limitation to the forward direction is not essential.
>>>> >
>>>> >
>>>> > In more detail, the algorithm is parameterized by:
>>>> >
>>>> >
>>>> >  *
>>>> >
>>>> >     a user-defined, finite-height partial order, with a join function
>>>> >     and maximal element,
>>>> >
>>>> >  *
>>>> >
>>>> >     an initial element from the user-defined lattice,
>>>> >
>>>> >  *
>>>> >
>>>> >     a (monotonic) transfer function for statements, which maps between
>>>> >     lattice elements,
>>>> >
>>>> >  *
>>>> >
>>>> >     comparison and merge functions for environment values.
>>>> >
>>>> >
>>>> > We expand on the first three elements here and return to the
>>>> > comparison and merge functions in the next section.
>>>> >
>>>> >
>>>> >       The Lattice and Analysis Abstractions
>>>> >
>>>> > We start with a representation of a lattice:
>>>> >
>>>> > classBoundedJoinSemiLattice{
>>>> >
>>>> > public:
>>>> >
>>>> > friendbooloperator==(constBoundedJoinSemiLattice&lhs,
>>>> >
>>>> > constBoundedJoinSemiLattice&rhs);
>>>> >
>>>> > friendbooloperator<=(constBoundedJoinSemiLattice&lhs,
>>>> >
>>>> > constBoundedJoinSemiLattice&rhs);
>>>> >
>>>> > staticBoundedJoinSemiLatticeTop();
>>>> >
>>>> > enumclassJoinEffect{
>>>> >
>>>> > Changed,
>>>> >
>>>> > Unchanged,
>>>> >
>>>> > };
>>>> >
>>>> > JoinEffectJoin(constBoundedJoinSemiLattice&element);
>>>> >
>>>> > };
>>>> >
>>>> > Not all of these declarations are necessary. Specifically,
>>>> > operator==and operator<=can be implemented in terms of Joinand
>>>> > Topisn’t used by the dataflow algorithm. We include the operators
>>>> > because direct implementations are likely more efficient, and
>>>> > Topbecause it provides evidence of a well-founded bounded join
>>>> > semi-lattice.
>>>> >
>>>> >
>>>> > The Joinoperation is unusual in its choice of mutating the object
>>>> > instance and returning an indication of whether the join resulted in
>>>> > any changes. A key step of the fixpoint algorithm is code like j =
>>>> > Join(x, y); if (j == x) …. This formulation of Joinallows us to
>>>> > express this code instead as if (x.Join(y) == Unchanged).... For
>>>> > lattices whose instances are expensive to create or compare, we
>>>> expect
>>>> > this formulation to improve the performance of the dataflow analysis.
>>>> >
>>>> >
>>>> > Based on the lattice abstraction, our analysis is similarly
>>>> traditional:
>>>> >
>>>> > classDataflowAnalysis{
>>>> >
>>>> > public:
>>>> >
>>>> > usingBoundedJoinSemiLattice=…;
>>>> >
>>>> > BoundedJoinSemiLatticeInitialElement();
>>>> >
>>>> > BoundedJoinSemiLatticeTransferStmt(
>>>> >
>>>> > constclang::Stmt*stmt,constBoundedJoinSemiLattice&element,
>>>> >
>>>> >       clang::dataflow::Environment&environment);
>>>> >
>>>> > };
>>>> >
>>>> >
>>>> >     Environment: A Built-in Lattice
>>>> >
>>>> > All user-defined operations have access to an environment, which
>>>> > encapsulates the program context of whatever program element is being
>>>> > considered. It contains:
>>>> >
>>>> >  *
>>>> >
>>>> >     a path condition,
>>>> >
>>>> >  *
>>>> >
>>>> >     a storage model (which models both local variables and the heap).
>>>> >
>>>> >
>>>> > The built-in transfer functions model the effects of C++ language
>>>> > constructs (for example, variable declarations, assignments, pointer
>>>> > dereferences, arithmetic etc.) by manipulating the environment,
>>>> saving
>>>> > the user from doing so in custom transfer functions.
>>>> >
>>>> >
>>>> > The path condition accumulates the set of boolean conditions that are
>>>> > known to be true on every path from function entry to the current
>>>> > program point.
>>>> >
>>>> >
>>>> > The environment maintains the maps from program declarations and
>>>> > pointer values to storage locations. It also maintains the map of
>>>> > storage locations to abstract values. Storage locations can be atomic
>>>> > (“scalar”) or aggregate multiple sublocations:
>>>> >
>>>> >  *
>>>> >
>>>> >     ScalarStorageLocationis a storage location that is not subdivided
>>>> >     further for the purposes of abstract interpretation, for example
>>>> >     bool, int*.
>>>> >
>>>> >  *
>>>> >
>>>> >     AggregateStorageLocationis a storage location which can be
>>>> >     subdivided into smaller storage locations that can be
>>>> >     independently tracked by abstract interpretation. For example, a
>>>> >     struct with public members.
>>>> >
>>>> >
>>>> > In addition to this storage structure, our model tracks three kinds
>>>> of
>>>> > values: basic values (like integers and booleans), pointers (which
>>>> > reify storage location into the value space) and records with named
>>>> > fields, where each field is itself a value. For aggregate values, we
>>>> > ensure coherence between the structure represented in storage
>>>> > locations and that represented in the values: for a given aggregate
>>>> > storage location Laggwith child f:
>>>> >
>>>> >  *
>>>> >
>>>> >     valueAt(Lagg) is a record, and
>>>> >
>>>> >  *
>>>> >
>>>> >     valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),
>>>> >
>>>> > where valueAtmaps storage locations to values.
>>>> >
>>>> >
>>>> > In our first iteration, the only basic values that we will track are
>>>> > boolean values. Additional values and variables may be relevant, but
>>>> > only in so far as they constitute part of a boolean expression that
>>>> we
>>>> > care about. For example, if (x > 5) { … }will result in our tracking
>>>> x
>>>> > > 5, but not any particular knowledge about the integer-valued
>>>> variable x.
>>>> >
>>>> >
>>>> > The path conditions and the model refer to the same sets of values
>>>> and
>>>> > locations.
>>>> >
>>>> >
>>>> >       Accounting for infinite height
>>>> >
>>>> > Despite the simplicity of our storage model, its elements can grow
>>>> > infinitely large: for example, path conditions on loop back edges,
>>>> > records when modeling linked lists, and even just the number of
>>>> > locations for a simple for-loop.  Therefore, we provide the user with
>>>> > two opportunities to interpret the abstract values so as to bound it:
>>>> > comparison of abstract values and a merge operation at control-flow
>>>> joins.
>>>> >
>>>> >
>>>> > // Compares values from two different environments for semantic
>>>> > equivalence.
>>>> >
>>>> > boolCompare(
>>>> >
>>>> >       clang::dataflow::Value*value1,clang::dataflow::Environment&env1,
>>>> >
>>>> >
>>>>       clang::dataflow::Value*value2,clang::dataflow::Environment&env2);
>>>> >
>>>> >
>>>> > // Given `value1` (w.r.t. `env1`) and `value2` (w.r.t. `env2`),
>>>> returns a
>>>> >
>>>> > // `Value` (w.r.t. `merged_env`) that approximates `value1` and
>>>> > `value2`. This
>>>> >
>>>> > // could be a strict lattice join, or a more general widening
>>>> operation.
>>>> >
>>>> >   clang::dataflow::Value*Merge(
>>>> >
>>>> >       clang::dataflow::Value*value1,clang::dataflow::Environment&env1,
>>>> >
>>>> >       clang::dataflow::Value*value2,clang::dataflow::Environment&env2,
>>>> >
>>>> >       clang::dataflow::Environment&merged_env);
>>>> >
>>>> >
>>>> >     Verification by SAT Solver
>>>> >
>>>> > The path conditions are critical in refining our approximation of the
>>>> > local variables and heap. At an abstract level, they don’t need to
>>>> > directly influence the execution of the dataflow analysis. Instead,
>>>> we
>>>> > can imagine that, once concluded, our results are annotated
>>>> > with/incorporate path conditions, which in turn allows us to make
>>>> more
>>>> > precise conclusions about our program from the analysis results.
>>>> >
>>>> >
>>>> > For example, if the analysis concludes that a variable conditionally
>>>> > holds a non-null pointer, and, for a given program point, the path
>>>> > condition implies said condition, we can conclude that a dereference
>>>> > of the pointer at that point is safe. In this sense, the verification
>>>> > of the safety of a deference is separate from the analysis, which
>>>> > models memory states with formulae. That said, invocation of the
>>>> > solver need not wait until after the analysis has concluded. It can
>>>> > also be used during the analysis, for example, in defining the
>>>> > widening operation in Merge.
>>>> >
>>>> >
>>>> > With that, we can see the role of an SAT solver: to solve questions
>>>> of
>>>> > the form “Does the path condition satisfy the pre-condition computed
>>>> > for this pointer variable at this program point”? To that end, we
>>>> > include an API for asking such questions of a SAT solver. The API is
>>>> > compatible with Z3 and we also include our own (simple) SAT solver,
>>>> > for users who may not want to integrate with Z3.
>>>> >
>>>> >
>>>> >     Testing dataflow analysis
>>>> >
>>>> > We provide test infrastructure that allows users to easily write
>>>> tests
>>>> > that make assertions about a program state computed by dataflow
>>>> > analysis at any code point. Code is annotated with labels on points
>>>> of
>>>> > interest and then test expectations are written with respect to the
>>>> > labeled points.
>>>> >
>>>> >
>>>> > For example, consider an analysis that computes the state of
>>>> > optionalvalues.
>>>> >
>>>> > ExpectDataflowAnalysis<OptionalChecker>(R"(
>>>> >
>>>> > void f(std::optional<int> opt) {
>>>> >
>>>> >   if (opt.has_value()) {
>>>> >
>>>> >     // [[true-branch]]
>>>> >
>>>> >   } else {
>>>> >
>>>> >     // [[false-branch]]
>>>> >
>>>> >   }
>>>> >
>>>> > })",[](constauto&results){
>>>> >
>>>> >
>>>>   EXPECT_LE(results.ValueAtCodePoint("true-branch"),OptionalLattice::Engaged());
>>>> >
>>>> >
>>>>   EXPECT_LE(results.ValueAtCodePoint("false-branch"),OptionalLattice::NullOpt());
>>>> >
>>>> > });
>>>> >
>>>> >
>>>> >   Timeline
>>>> >
>>>> > We have a fully working implementation of the framework as proposed
>>>> > here. We also are interested in contributing a clang-tidy checker for
>>>> > std::optionaland related types as soon as possible, given its
>>>> > demonstrated efficacy and interest from third parties.  To that end,
>>>> > as soon as we have consensus for starting this work in the clang
>>>> > repository, we plan to start sending patches for review. Given that
>>>> > the framework is relatively small, we expect it will fit within 5
>>>> > (largish) patches, meaning we can hope to have it available by
>>>> > December 2021.
>>>> >
>>>> >
>>>> >   Future Work
>>>> >
>>>> > We consider our first version of the framework as experimental. It
>>>> > suits the needs of a number of particular checks that we’ve been
>>>> > writing, but it has some obvious limitations with respect to general
>>>> > use.  After we commit our first version, here are some planned areas
>>>> > of improvement and investigation.
>>>> >
>>>> >
>>>> > Backwards analyses.The framework only solves forward dataflow
>>>> > equations. This clearly prevents implementing some very familiar and
>>>> > useful analyses, like variable liveness. We intend to expand the
>>>> > framework to also support backwards dataflow equations. There’s
>>>> > nothing inherent to our framework that limits it to forwards.
>>>> >
>>>> >
>>>> > Reasoning about non-boolean values.Currently, we can symbolically
>>>> > model equality of booleans, but nothing else. Yet, for locations on
>>>> > the store that haven’t changed, we can understand which values are
>>>> > identical (because the input value object is passed by the transfer
>>>> > function to the output). For example,
>>>> >
>>>> > voidf(bool*a,bool*b){
>>>> >
>>>> >     std::optional<int>opt;
>>>> >
>>>> > if(a ==b){opt =42;}
>>>> >
>>>> > if(a ==b){opt.value();}
>>>> >
>>>> > }
>>>> >
>>>> > Since neither anor bhave changed, we can reason that the two
>>>> > occurrences of the expression a == bevaluate to the same value (even
>>>> > though we don't know the exact value). We intend to improve the
>>>> > framework to support this increased precision, given that it will not
>>>> > cost us any additional complexity.
>>>> >
>>>> >
>>>> > Improved model of program state.A major obstacle to writing
>>>> > high-quality checks for C++ code is the complexity of C++’s
>>>> > semantics.  In writing our framework, we’ve come to see the value of
>>>> > shielding the user from the need to handle these directly by modeling
>>>> > program state and letting the user express their check in terms of
>>>> > that model, after it has been solved for each program point. As is,
>>>> > our model currently supports a number of checks, including the
>>>> > std::optionalcheck and another that infers which pointer parameters
>>>> > are out parameters. We intend to expand the set of such checks, with
>>>> > improvements to the model and the API for expressing domain-specific
>>>> > semantics relevant to a check. Additionally, we are considering
>>>> > explicitly tracking other abstract program properties, like reads and
>>>> > writes to local variables and memory, so that users need not account
>>>> > for all the syntactic variants of these operations.
>>>> >
>>>> >
>>>> > Refactoring existing checks.Once the framework is more mature and
>>>> > stable, we would like to refactor existing checks that use custom,
>>>> > ad-hoc implementations of dataflow to use this framework.
>>>> >
>>>> >
>>>> >   Conclusion
>>>> >
>>>> > We have proposed an addition to the Clang repository of a new static
>>>> > analysis framework targeting all-pathsanalyses. The intended clients
>>>> > of such analyses are code-checking and transformation tools like
>>>> > clang-tidy, where sound analyses can support sound transformations of
>>>> > code. We discussed the key features and APIs of our framework and how
>>>> > they fit together into a new analysis framework. Moreover, our
>>>> > proposal is based on experience with a working implementation that
>>>> > supports a number of practical, high-precision analyses, including
>>>> one
>>>> > for safe use of types like std::optionaland another that finds missed
>>>> > opportunities for use of std::move.
>>>> >
>>>> >
>>>> >
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211013/eaa74a36/attachment-0001.html>