[cfe-dev] [RFC] A dataflow analysis framework for Clang AST

Thu Oct 14 00:44:43 PDT 2021

(replying again because my first reply didn't end up in the mailing list)

> I'm also very curious about your SAT solver implementation; I have a
>
> recent hobby of plugging various solvers into the static analyzer (cf.
>
> https://reviews.llvm.org/D110125) because it obviously needs some -
>
> arguably as much as your endeavor.
>
>

Yes, we'd love to tell you more, but I'll have to defer to sgatev@, since
> he's the author and most familiar with the details.

Currently, we convert a path condition boolean formula to CNF and use a
basic DPLL <https://en.wikipedia.org/wiki/DPLL_algorithm> solver to check
validity. The solver is not really optimized to handle large formulae as it
performs some unnecessary copies in the recursive step, but this hasn't
been a problem so far as the analysis for safe use of std::optional doesn't
produce large formulae. Still, I believe we can improve the performance of
the solver while keeping the implementation lightweight. We're also working
on a more general interface that would allow plugging different solver
implementations in the framework.

Cheers,
Stanislav

On Wed, Oct 13, 2021 at 2:09 PM Stanislav Gatev <sgatev at google.com> wrote:

> I'm also very curious about your SAT solver implementation; I have a
>>
>> recent hobby of plugging various solvers into the static analyzer (cf.
>>
>> https://reviews.llvm.org/D110125) because it obviously needs some -
>>
>> arguably as much as your endeavor.
>>
>>
>
> Yes, we'd love to tell you more, but I'll have to defer to sgatev@, since
>> he's the author and most familiar with the details.
>
>
> Currently, we convert a path condition boolean formula to CNF and use a
> basic DPLL <https://en.wikipedia.org/wiki/DPLL_algorithm> solver to check
> validity. The solver is not really optimized to handle large formulae as it
> performs some unnecessary copies in the recursive step, but this hasn't
> been a problem so far as the analysis for safe use of std::optional doesn't
> produce large formulae. Still, I believe we can improve the performance of
> the solver while keeping the implementation lightweight. We're also working
> on a more general interface that would allow plugging different solver
> implementations in the framework.
>
> Cheers,
> Stanislav
>
> On Wed, Oct 13, 2021 at 3:40 AM Yitzhak Mandelbaum <yitzhakm at google.com>
> wrote:
>
>> (for real this time)
>>
>> On Tue, Oct 12, 2021 at 9:39 PM Yitzhak Mandelbaum <yitzhakm at google.com>
>> wrote:
>>
>>> -vsavchenko at apple.com (which keeps bouncing)
>>>
>>> On Tue, Oct 12, 2021 at 8:52 PM Yitzhak Mandelbaum <yitzhakm at google.com>
>>> wrote:
>>>
>>>> Gabor, Artem,
>>>>
>>>> Thanks for your support, and, Artem, thank you for quoting as plain
>>>> text -- you're right, the system shunted it to the moderator because of its
>>>> size.
>>>>
>>>> To answer some of the questions/comments raised:
>>>> > There is a nice summary of some of the ad-hoc solutions we have in
>>>> Clang in this survey: [cfe-dev] A survey of dataflow analyses in Clang
>>>> (llvm.org)
>>>> <https://lists.llvm.org/pipermail/cfe-dev/2020-October/066937.html>
>>>> > Do you plan to replace the ad-hoc solutions in the long run?
>>>>
>>>> We mention this briefly in the RFC as motivation, but to reiterate
>>>> here: we'd like the framework to at least be *suitable* to replace these
>>>> ad-hoc solutions. Whether we would get to this ourselves or encourage
>>>> others, review patches, etc. I can't say at this point.  I suspect we'll
>>>> refactor at least a few ourselves just to test out the robustness of the
>>>> framework.
>>>>
>>>> Just to reiterate, you want to abstract away both from the dataflow
>>>>> algorithm and the semantics of C++. E.g., the user might not need to handle
>>>>> the myriad of ways someone can assign a value to a variable (overloaded
>>>>> operator=, output argument, primitive assignment, etc), only a single
>>>>> abstract concept of assignment.
>>>>
>>>>
>>>> Exactly!  Our first version already does this to some extent, we just
>>>> feel it's somewhat tailored to our current needs and could use some
>>>> generalization.  For example, it would be great to know when an lvalue is
>>>> read for the last time and written to another lvalue, because that would
>>>> suggest a potential `std::move`, among other things. Yet, that balance of
>>>> detail and abstraction is not yet available. We only model the *effects* of
>>>> reads and writes without explicitly abstracting them (say, tagging each CFG
>>>> node w/ the set of reads and writes).  So, we think there's room to explore
>>>> the design space here, but also think what we have will be valuable to a
>>>> number of analyses right away.
>>>>
>>>> While most of the checks in CSA are using symbolic execution, there are
>>>>> also some syntactic rules. One dilemma of implementing a new check is where
>>>>> to put it (CSA? Tidy? Warnings?). While it would be great to get rid of
>>>>> this dilemma, I think we want to keep all of the options open. I just
>>>>> wanted to make it explicit what kinds of layering constraints we have. It
>>>>> should be possible to add CSA checks/features using the new dataflow
>>>>> framework.
>>>>
>>>>
>>>> Great point! We hadn't thought about this possibility, but there's no
>>>> good reason to rule out use in CSA.  Our key layering constraint is that it
>>>> be available in clang-tidy. So, if we put it somewhere central, like
>>>> clang/include/clang/Analysis, would that be suitable for use in CSA? We're
>>>> certainly open to other suggestions as well.
>>>>
>>>> Valeriy's -Wcompletion-handler being
>>>>> one of the notable latest additions - a backwards analysis over finite
>>>>> state lattice with a sophisticated system of notes attached to the
>>>>> warning to explain what's happening (generalizing over the latter may
>>>>> possibly be a goal for your framework as well).
>>>>
>>>>
>>>> Yes, we're definitely interested in being able to produce warnings that
>>>> give the user a clear idea of what is wrong and (if possible) how to fix
>>>> it.  This is complicated once we involve the SAT solver, since it tends to
>>>> give us simple yes/no answers. But, actionable warnings seem crucial for
>>>> successful user adoption.
>>>>
>>>> I'm also very curious about your SAT solver implementation; I have a
>>>>> recent hobby of plugging various solvers into the static analyzer (cf.
>>>>> https://reviews.llvm.org/D110125) because it obviously needs some -
>>>>> arguably as much as your endeavor.
>>>>
>>>>
>>>> Yes, we'd love to tell you more, but I'll have to defer to sgatev@,
>>>> since he's the author and most familiar with the details.
>>>>
>>>> Cheers,
>>>> Yitzhak
>>>>
>>>>
>>>> On Tue, Oct 12, 2021 at 5:34 PM Artem Dergachev <noqnoqneo at gmail.com>
>>>> wrote:
>>>>
>>>>> Sounds like the mail didn't hit the mailing list due to size
>>>>> restriction. I'm quoting in plain text now to make sure it definitely
>>>>> hits the list in some form.
>>>>>
>>>>> Like Gabor already said, It's amazing that this is happening. I
>>>>> completely agree with your goals and I think we've discussed a lot of
>>>>> this offline earlier. The amount of existing flow-sensitive warnings
>>>>> may
>>>>> be low but it's steadily growing, Valeriy's -Wcompletion-handler being
>>>>> one of the notable latest additions - a backwards analysis over finite
>>>>> state lattice with a sophisticated system of notes attached to the
>>>>> warning to explain what's happening (generalizing over the latter may
>>>>> possibly be a goal for your framework as well).
>>>>>
>>>>> Aside from the ability to cover *the other 50% of warnings* compared
>>>>> to
>>>>> what the static analyzer already covers (may-problems vs.
>>>>> must-problems
>>>>> aka all-paths problems), another strong benefit of flow sensitive
>>>>> warnings is their performance which allows us to use them as compiler
>>>>> warnings (as opposed to separate tools like clang-tidy or static
>>>>> analyzer). This in turn delivers the warning to every user of the
>>>>> compiler which increases the user coverage dramatically. This
>>>>> consideration may even apply to may-problems in which the static
>>>>> analyzer already excels technically.
>>>>>
>>>>> I'm also very curious about your SAT solver implementation; I have a
>>>>> recent hobby of plugging various solvers into the static analyzer (cf.
>>>>> https://reviews.llvm.org/D110125) because it obviously needs some -
>>>>> arguably as much as your endeavor.
>>>>>
>>>>>
>>>>> On 10/12/21 8:32 AM, Yitzhak Mandelbaum wrote:
>>>>> > Summary
>>>>> >
>>>>> > We propose a new static analysis framework for implementing Clang
>>>>> > bug-finding and refactoring tools (including ClangTidy checks) based
>>>>> > on a dataflow algorithm.
>>>>> >
>>>>> >
>>>>> > Some ClangTidy checks and core compiler warnings are already
>>>>> > dataflowbased. However, due to a lack of a reusable framework, all
>>>>> > reimplement common dataflow-analysis and CFG-traversal algorithms.
>>>>> In
>>>>> > addition to supporting new checks, the framework should allow us to
>>>>> > clean up those parts of Clang and ClangTidy.
>>>>> >
>>>>> >
>>>>> > We intend that the new framework should have a learning curve
>>>>> similar
>>>>> > to AST matchers and not require prior expertise in static analysis.
>>>>> > Our goal is a simple API that isolates the user from the dataflow
>>>>> > algorithm itself. The initial version of the framework combines
>>>>> > support for classic dataflow analysesover the Clang CFG with
>>>>> symbolic
>>>>> > evaluation of the code and a simple model of local variables and the
>>>>> > heap. A number of useful checks can be expressed solely in terms of
>>>>> > this model, saving their implementers from having to reimplement
>>>>> > modeling of the semantics of C++ AST nodes. Ultimately, we want to
>>>>> > extend this convenience to a wide variety of checks, so that they
>>>>> can
>>>>> > be written in terms of the abstract semantics of the program (as
>>>>> > modeled by the framework), while only matching and modeling AST
>>>>> nodes
>>>>> > for key parts of their domain (e.g., custom data types or functions).
>>>>> >
>>>>> >
>>>>> >   Motivation
>>>>> >
>>>>> > Clang has a few subsystems that greatly help with implementing
>>>>> > bug-finding and refactoring tools:
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     AST matchersallow one to easily find program fragments whose AST
>>>>> >     matches a given pattern. This works well for finding program
>>>>> >     fragments whose desired semantics translates to a small number of
>>>>> >     possible syntactic expressions (e.g., "find a call to
>>>>> >     strlen()where the argument is a null pointer constant"). Note
>>>>> that
>>>>> >     patterns expressed by AST matchers often check program properties
>>>>> >     that hold for allexecution paths.
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     Clang Static Analyzercan find program paths that end in a state
>>>>> >     that satisfies a specific user-defined property, regardless of
>>>>> how
>>>>> >     the code was expressed syntactically (e.g., "find a call to
>>>>> >     strlen()where the argument is dynamically equal to null"). CSA
>>>>> >     abstracts from the syntax of the code, letting analyses focus on
>>>>> >     its semantics. While not completely eliding syntax, symbolic
>>>>> >     evaluation abstracts over many syntactic details. Semantically
>>>>> >     equivalent program fragments result in equivalent program states
>>>>> >     computed by the CSA.
>>>>> >
>>>>> >
>>>>> > When implementing certain bug-finding and refactoring tools, it is
>>>>> > often necessary to find program fragments where a given property
>>>>> holds
>>>>> > on all execution paths, while placing as few demands as possible on
>>>>> > how the user writes their code. The proposed dataflow analysis
>>>>> > framework aims to satisfy both of these needs simultaneously.
>>>>> >
>>>>> >
>>>>> >     Example: Ensuring that std::optionalis used safely
>>>>> >
>>>>> > Consider implementing a static analysis tool that proves that every
>>>>> > time a std::optionalis unwrapped, the program has ensured that the
>>>>> > precondition of the unwrap operation (the optional has a value) is
>>>>> > satisfied.
>>>>> >
>>>>> >
>>>>> > voidTestOK(std::optional<int>x){
>>>>> >
>>>>> > if(x.has_value()){
>>>>> >
>>>>> > use(x.value());
>>>>> >
>>>>> > }
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > voidTestWithWarning(std::optional<int>x){
>>>>> >
>>>>> > use(x.value());// warning: unchecked access to optional value
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > If we implemented this check in CSA, it would likely find the bug in
>>>>> > TestWithWarning, and would not report anything in TestOK. However,
>>>>> CSA
>>>>> > did not prove the desired property (all optional unwraps are checked
>>>>> > on all paths) for TestOK; CSA just could not find a path through
>>>>> > TestOKwhere this property is violated.
>>>>> >
>>>>> >
>>>>> > A simple function like TestOKonly has two paths and would be
>>>>> > completely explored by the CSA. However, when loops are involved,
>>>>> CSA
>>>>> > uses heuristics to decide when to stop exploring paths in a
>>>>> function.
>>>>> > As a result, CSA would miss this bug:
>>>>> >
>>>>> >
>>>>> > voidMissedBugByCSA(std::optional<int>x){
>>>>> >
>>>>> > for(inti =0;i <10;i++){
>>>>> >
>>>>> > if(i <5){
>>>>> >
>>>>> > use(i);
>>>>> >
>>>>> > }else{
>>>>> >
>>>>> > use(x.value());// CSA does not report a warning
>>>>> >
>>>>> > }
>>>>> >
>>>>> > }
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > If we implemented this check with AST matchers (as typically done in
>>>>> > ClangTidy), we would have to enumerate way too many syntactic
>>>>> > patterns. Consider just a couple coding variations that users expect
>>>>> > this analysis to understand to be safe:
>>>>> >
>>>>> >
>>>>> > voidTestOK(std::optional<int>x){
>>>>> >
>>>>> > if(x){
>>>>> >
>>>>> > use(x.value());
>>>>> >
>>>>> > }
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > voidTestOK(std::optional<int>x){
>>>>> >
>>>>> > if(x.has_value()==true){
>>>>> >
>>>>> > use(x.value());
>>>>> >
>>>>> > }
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > voidTestOK(std::optional<int>x){
>>>>> >
>>>>> > if(!x.has_value()){
>>>>> >
>>>>> > use(0);
>>>>> >
>>>>> > }else{
>>>>> >
>>>>> > use(x.value());
>>>>> >
>>>>> > }
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > voidTestOK(std::optional<int>x){
>>>>> >
>>>>> > if(!x.has_value())
>>>>> >
>>>>> > return;
>>>>> >
>>>>> > use(x.value());
>>>>> >
>>>>> > }
>>>>> >
>>>>> >
>>>>> > To summarize, CSA can find someoptional-unwrapping bugs, but it does
>>>>> > not prove that the code performs a check on allpaths. AST matchers
>>>>> can
>>>>> > prove it for a finite set of hard-coded syntactic cases. But, when
>>>>> the
>>>>> > set of safe patterns is infinite, as is often the case, AST matchers
>>>>> > can only capture a subset. Capturing the full set requires building
>>>>> a
>>>>> > framework like the one presented in this document.
>>>>> >
>>>>> >
>>>>> >     Comparison
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > AST Matchers Library
>>>>> >
>>>>> >
>>>>> >
>>>>> > Clang Static Analyzer
>>>>> >
>>>>> >
>>>>> >
>>>>> > Dataflow Analysis Framework
>>>>> >
>>>>> > Proves that a property holds on all execution paths
>>>>> >
>>>>> >
>>>>> >
>>>>> > Possible, depends on the matcher
>>>>> >
>>>>> >
>>>>> >
>>>>> > No
>>>>> >
>>>>> >
>>>>> >
>>>>> > Yes
>>>>> >
>>>>> > Checks can be written in terms of program semantics
>>>>> >
>>>>> >
>>>>> >
>>>>> > No
>>>>> >
>>>>> >
>>>>> >
>>>>> > Yes
>>>>> >
>>>>> >
>>>>> >
>>>>> > Yes
>>>>> >
>>>>> >
>>>>> >     Further Examples
>>>>> >
>>>>> > While we have started with a std::optionalchecker, we have a number
>>>>> of
>>>>> > further use cases where this framework can help. Note that some of
>>>>> > these are already implemented in complex, specialized checks, either
>>>>> > directly in the compiler or as clang-tidy checks. A key goal of our
>>>>> > framework is to simplify the implementation of such checks in the
>>>>> future.
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “the pointer may be null when dereferenced”, “the pointer is
>>>>> >     always null when dereferenced”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     "unnecessary null check on a pointer passed to delete/free",
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “this raw pointer variable could be refactored into a
>>>>> >     std::unique_ptr”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “expression always evaluates to a constant”, “if condition always
>>>>> >     evaluates to true/false”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “basic block is never executed”, “loop is executed at most once”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “assigned value is always overwritten”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “value was used after having been moved”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “value guarded by a lock was used without taking the lock”,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     “unnecessary existence check before inserting a value into a
>>>>> >     std::map”.
>>>>> >
>>>>> >
>>>>> > Some of these use cases are covered by existing ClangTidy checks
>>>>> that
>>>>> > are implemented with pattern matching. We believe there is an
>>>>> > opportunity to simplify their implementation by refactoring them to
>>>>> > use the dataflow framework, and simultaneously improve the coverage
>>>>> > and precision of how they model C++ semantics.
>>>>> >
>>>>> >
>>>>> >   Overview
>>>>> >
>>>>> > The initial version of our framework provides the following key
>>>>> features:
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     a forward dataflow algorithm, parameterized by a user-defined
>>>>> >     value domain and a transfer function, which describes how program
>>>>> >     statements modify domain values.
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     a built-in model of local variables and the heap. This model
>>>>> >     provides two benefits to users: it tracks the flow of values,
>>>>> >     accounting for a wide variety of C++ language constructs, in a
>>>>> way
>>>>> >     that is reusable across many different analyses; it discriminates
>>>>> >     value approximations based on path conditions, allowing for
>>>>> higher
>>>>> >     precision than dataflow alone.
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     an interface to an SMT solver to verify safety of operations like
>>>>> >     dereferencing a pointer,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     test infrastructure for validating results of the dataflow
>>>>> algorithm.
>>>>> >
>>>>> >
>>>>> >     Dataflow Algorithm
>>>>> >
>>>>> > Our framework solves forward dataflow equations over a product of a
>>>>> > user-defined value domain and a built-in model of local variables
>>>>> and
>>>>> > the heap, called the environment. The user implements the transfer
>>>>> > function for their value domain, while the framework implements the
>>>>> > transfer function that operates on the environment, only requiring a
>>>>> > user-defined widening operator to avoid infinite growth of the
>>>>> > environment model. We intend to expand the framework to also support
>>>>> > backwards dataflow equations at some point in the future. The
>>>>> > limitation to the forward direction is not essential.
>>>>> >
>>>>> >
>>>>> > In more detail, the algorithm is parameterized by:
>>>>> >
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     a user-defined, finite-height partial order, with a join function
>>>>> >     and maximal element,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     an initial element from the user-defined lattice,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     a (monotonic) transfer function for statements, which maps
>>>>> between
>>>>> >     lattice elements,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     comparison and merge functions for environment values.
>>>>> >
>>>>> >
>>>>> > We expand on the first three elements here and return to the
>>>>> > comparison and merge functions in the next section.
>>>>> >
>>>>> >
>>>>> >       The Lattice and Analysis Abstractions
>>>>> >
>>>>> > We start with a representation of a lattice:
>>>>> >
>>>>> > classBoundedJoinSemiLattice{
>>>>> >
>>>>> > public:
>>>>> >
>>>>> > friendbooloperator==(constBoundedJoinSemiLattice&lhs,
>>>>> >
>>>>> > constBoundedJoinSemiLattice&rhs);
>>>>> >
>>>>> > friendbooloperator<=(constBoundedJoinSemiLattice&lhs,
>>>>> >
>>>>> > constBoundedJoinSemiLattice&rhs);
>>>>> >
>>>>> > staticBoundedJoinSemiLatticeTop();
>>>>> >
>>>>> > enumclassJoinEffect{
>>>>> >
>>>>> > Changed,
>>>>> >
>>>>> > Unchanged,
>>>>> >
>>>>> > };
>>>>> >
>>>>> > JoinEffectJoin(constBoundedJoinSemiLattice&element);
>>>>> >
>>>>> > };
>>>>> >
>>>>> > Not all of these declarations are necessary. Specifically,
>>>>> > operator==and operator<=can be implemented in terms of Joinand
>>>>> > Topisn’t used by the dataflow algorithm. We include the operators
>>>>> > because direct implementations are likely more efficient, and
>>>>> > Topbecause it provides evidence of a well-founded bounded join
>>>>> > semi-lattice.
>>>>> >
>>>>> >
>>>>> > The Joinoperation is unusual in its choice of mutating the object
>>>>> > instance and returning an indication of whether the join resulted in
>>>>> > any changes. A key step of the fixpoint algorithm is code like j =
>>>>> > Join(x, y); if (j == x) …. This formulation of Joinallows us to
>>>>> > express this code instead as if (x.Join(y) == Unchanged).... For
>>>>> > lattices whose instances are expensive to create or compare, we
>>>>> expect
>>>>> > this formulation to improve the performance of the dataflow analysis.
>>>>> >
>>>>> >
>>>>> > Based on the lattice abstraction, our analysis is similarly
>>>>> traditional:
>>>>> >
>>>>> > classDataflowAnalysis{
>>>>> >
>>>>> > public:
>>>>> >
>>>>> > usingBoundedJoinSemiLattice=…;
>>>>> >
>>>>> > BoundedJoinSemiLatticeInitialElement();
>>>>> >
>>>>> > BoundedJoinSemiLatticeTransferStmt(
>>>>> >
>>>>> > constclang::Stmt*stmt,constBoundedJoinSemiLattice&element,
>>>>> >
>>>>> >       clang::dataflow::Environment&environment);
>>>>> >
>>>>> > };
>>>>> >
>>>>> >
>>>>> >     Environment: A Built-in Lattice
>>>>> >
>>>>> > All user-defined operations have access to an environment, which
>>>>> > encapsulates the program context of whatever program element is
>>>>> being
>>>>> > considered. It contains:
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     a path condition,
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     a storage model (which models both local variables and the heap).
>>>>> >
>>>>> >
>>>>> > The built-in transfer functions model the effects of C++ language
>>>>> > constructs (for example, variable declarations, assignments, pointer
>>>>> > dereferences, arithmetic etc.) by manipulating the environment,
>>>>> saving
>>>>> > the user from doing so in custom transfer functions.
>>>>> >
>>>>> >
>>>>> > The path condition accumulates the set of boolean conditions that
>>>>> are
>>>>> > known to be true on every path from function entry to the current
>>>>> > program point.
>>>>> >
>>>>> >
>>>>> > The environment maintains the maps from program declarations and
>>>>> > pointer values to storage locations. It also maintains the map of
>>>>> > storage locations to abstract values. Storage locations can be
>>>>> atomic
>>>>> > (“scalar”) or aggregate multiple sublocations:
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     ScalarStorageLocationis a storage location that is not subdivided
>>>>> >     further for the purposes of abstract interpretation, for example
>>>>> >     bool, int*.
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     AggregateStorageLocationis a storage location which can be
>>>>> >     subdivided into smaller storage locations that can be
>>>>> >     independently tracked by abstract interpretation. For example, a
>>>>> >     struct with public members.
>>>>> >
>>>>> >
>>>>> > In addition to this storage structure, our model tracks three kinds
>>>>> of
>>>>> > values: basic values (like integers and booleans), pointers (which
>>>>> > reify storage location into the value space) and records with named
>>>>> > fields, where each field is itself a value. For aggregate values, we
>>>>> > ensure coherence between the structure represented in storage
>>>>> > locations and that represented in the values: for a given aggregate
>>>>> > storage location Laggwith child f:
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     valueAt(Lagg) is a record, and
>>>>> >
>>>>> >  *
>>>>> >
>>>>> >     valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),
>>>>> >
>>>>> > where valueAtmaps storage locations to values.
>>>>> >
>>>>> >
>>>>> > In our first iteration, the only basic values that we will track are
>>>>> > boolean values. Additional values and variables may be relevant, but
>>>>> > only in so far as they constitute part of a boolean expression that
>>>>> we
>>>>> > care about. For example, if (x > 5) { … }will result in our tracking
>>>>> x
>>>>> > > 5, but not any particular knowledge about the integer-valued
>>>>> variable x.
>>>>> >
>>>>> >
>>>>> > The path conditions and the model refer to the same sets of values
>>>>> and
>>>>> > locations.
>>>>> >
>>>>> >
>>>>> >       Accounting for infinite height
>>>>> >
>>>>> > Despite the simplicity of our storage model, its elements can grow
>>>>> > infinitely large: for example, path conditions on loop back edges,
>>>>> > records when modeling linked lists, and even just the number of
>>>>> > locations for a simple for-loop.  Therefore, we provide the user
>>>>> with
>>>>> > two opportunities to interpret the abstract values so as to bound
>>>>> it:
>>>>> > comparison of abstract values and a merge operation at control-flow
>>>>> joins.
>>>>> >
>>>>> >
>>>>> > // Compares values from two different environments for semantic
>>>>> > equivalence.
>>>>> >
>>>>> > boolCompare(
>>>>> >
>>>>> >
>>>>>       clang::dataflow::Value*value1,clang::dataflow::Environment&env1,
>>>>> >
>>>>> >
>>>>>       clang::dataflow::Value*value2,clang::dataflow::Environment&env2);
>>>>> >
>>>>> >
>>>>> > // Given `value1` (w.r.t. `env1`) and `value2` (w.r.t. `env2`),
>>>>> returns a
>>>>> >
>>>>> > // `Value` (w.r.t. `merged_env`) that approximates `value1` and
>>>>> > `value2`. This
>>>>> >
>>>>> > // could be a strict lattice join, or a more general widening
>>>>> operation.
>>>>> >
>>>>> >   clang::dataflow::Value*Merge(
>>>>> >
>>>>> >
>>>>>       clang::dataflow::Value*value1,clang::dataflow::Environment&env1,
>>>>> >
>>>>> >
>>>>>       clang::dataflow::Value*value2,clang::dataflow::Environment&env2,
>>>>> >
>>>>> >       clang::dataflow::Environment&merged_env);
>>>>> >
>>>>> >
>>>>> >     Verification by SAT Solver
>>>>> >
>>>>> > The path conditions are critical in refining our approximation of
>>>>> the
>>>>> > local variables and heap. At an abstract level, they don’t need to
>>>>> > directly influence the execution of the dataflow analysis. Instead,
>>>>> we
>>>>> > can imagine that, once concluded, our results are annotated
>>>>> > with/incorporate path conditions, which in turn allows us to make
>>>>> more
>>>>> > precise conclusions about our program from the analysis results.
>>>>> >
>>>>> >
>>>>> > For example, if the analysis concludes that a variable conditionally
>>>>> > holds a non-null pointer, and, for a given program point, the path
>>>>> > condition implies said condition, we can conclude that a dereference
>>>>> > of the pointer at that point is safe. In this sense, the
>>>>> verification
>>>>> > of the safety of a deference is separate from the analysis, which
>>>>> > models memory states with formulae. That said, invocation of the
>>>>> > solver need not wait until after the analysis has concluded. It can
>>>>> > also be used during the analysis, for example, in defining the
>>>>> > widening operation in Merge.
>>>>> >
>>>>> >
>>>>> > With that, we can see the role of an SAT solver: to solve questions
>>>>> of
>>>>> > the form “Does the path condition satisfy the pre-condition computed
>>>>> > for this pointer variable at this program point”? To that end, we
>>>>> > include an API for asking such questions of a SAT solver. The API is
>>>>> > compatible with Z3 and we also include our own (simple) SAT solver,
>>>>> > for users who may not want to integrate with Z3.
>>>>> >
>>>>> >
>>>>> >     Testing dataflow analysis
>>>>> >
>>>>> > We provide test infrastructure that allows users to easily write
>>>>> tests
>>>>> > that make assertions about a program state computed by dataflow
>>>>> > analysis at any code point. Code is annotated with labels on points
>>>>> of
>>>>> > interest and then test expectations are written with respect to the
>>>>> > labeled points.
>>>>> >
>>>>> >
>>>>> > For example, consider an analysis that computes the state of
>>>>> > optionalvalues.
>>>>> >
>>>>> > ExpectDataflowAnalysis<OptionalChecker>(R"(
>>>>> >
>>>>> > void f(std::optional<int> opt) {
>>>>> >
>>>>> >   if (opt.has_value()) {
>>>>> >
>>>>> >     // [[true-branch]]
>>>>> >
>>>>> >   } else {
>>>>> >
>>>>> >     // [[false-branch]]
>>>>> >
>>>>> >   }
>>>>> >
>>>>> > })",[](constauto&results){
>>>>> >
>>>>> >
>>>>>   EXPECT_LE(results.ValueAtCodePoint("true-branch"),OptionalLattice::Engaged());
>>>>> >
>>>>> >
>>>>>   EXPECT_LE(results.ValueAtCodePoint("false-branch"),OptionalLattice::NullOpt());
>>>>> >
>>>>> > });
>>>>> >
>>>>> >
>>>>> >   Timeline
>>>>> >
>>>>> > We have a fully working implementation of the framework as proposed
>>>>> > here. We also are interested in contributing a clang-tidy checker
>>>>> for
>>>>> > std::optionaland related types as soon as possible, given its
>>>>> > demonstrated efficacy and interest from third parties.  To that end,
>>>>> > as soon as we have consensus for starting this work in the clang
>>>>> > repository, we plan to start sending patches for review. Given that
>>>>> > the framework is relatively small, we expect it will fit within 5
>>>>> > (largish) patches, meaning we can hope to have it available by
>>>>> > December 2021.
>>>>> >
>>>>> >
>>>>> >   Future Work
>>>>> >
>>>>> > We consider our first version of the framework as experimental. It
>>>>> > suits the needs of a number of particular checks that we’ve been
>>>>> > writing, but it has some obvious limitations with respect to general
>>>>> > use.  After we commit our first version, here are some planned areas
>>>>> > of improvement and investigation.
>>>>> >
>>>>> >
>>>>> > Backwards analyses.The framework only solves forward dataflow
>>>>> > equations. This clearly prevents implementing some very familiar and
>>>>> > useful analyses, like variable liveness. We intend to expand the
>>>>> > framework to also support backwards dataflow equations. There’s
>>>>> > nothing inherent to our framework that limits it to forwards.
>>>>> >
>>>>> >
>>>>> > Reasoning about non-boolean values.Currently, we can symbolically
>>>>> > model equality of booleans, but nothing else. Yet, for locations on
>>>>> > the store that haven’t changed, we can understand which values are
>>>>> > identical (because the input value object is passed by the transfer
>>>>> > function to the output). For example,
>>>>> >
>>>>> > voidf(bool*a,bool*b){
>>>>> >
>>>>> >     std::optional<int>opt;
>>>>> >
>>>>> > if(a ==b){opt =42;}
>>>>> >
>>>>> > if(a ==b){opt.value();}
>>>>> >
>>>>> > }
>>>>> >
>>>>> > Since neither anor bhave changed, we can reason that the two
>>>>> > occurrences of the expression a == bevaluate to the same value (even
>>>>> > though we don't know the exact value). We intend to improve the
>>>>> > framework to support this increased precision, given that it will
>>>>> not
>>>>> > cost us any additional complexity.
>>>>> >
>>>>> >
>>>>> > Improved model of program state.A major obstacle to writing
>>>>> > high-quality checks for C++ code is the complexity of C++’s
>>>>> > semantics.  In writing our framework, we’ve come to see the value of
>>>>> > shielding the user from the need to handle these directly by
>>>>> modeling
>>>>> > program state and letting the user express their check in terms of
>>>>> > that model, after it has been solved for each program point. As is,
>>>>> > our model currently supports a number of checks, including the
>>>>> > std::optionalcheck and another that infers which pointer parameters
>>>>> > are out parameters. We intend to expand the set of such checks, with
>>>>> > improvements to the model and the API for expressing domain-specific
>>>>> > semantics relevant to a check. Additionally, we are considering
>>>>> > explicitly tracking other abstract program properties, like reads
>>>>> and
>>>>> > writes to local variables and memory, so that users need not account
>>>>> > for all the syntactic variants of these operations.
>>>>> >
>>>>> >
>>>>> > Refactoring existing checks.Once the framework is more mature and
>>>>> > stable, we would like to refactor existing checks that use custom,
>>>>> > ad-hoc implementations of dataflow to use this framework.
>>>>> >
>>>>> >
>>>>> >   Conclusion
>>>>> >
>>>>> > We have proposed an addition to the Clang repository of a new static
>>>>> > analysis framework targeting all-pathsanalyses. The intended clients
>>>>> > of such analyses are code-checking and transformation tools like
>>>>> > clang-tidy, where sound analyses can support sound transformations
>>>>> of
>>>>> > code. We discussed the key features and APIs of our framework and
>>>>> how
>>>>> > they fit together into a new analysis framework. Moreover, our
>>>>> > proposal is based on experience with a working implementation that
>>>>> > supports a number of practical, high-precision analyses, including
>>>>> one
>>>>> > for safe use of types like std::optionaland another that finds
>>>>> missed
>>>>> > opportunities for use of std::move.
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211014/b44f347f/attachment-0001.html>