<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Gabor, Artem,</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Thanks for your support, and, Artem, thank you for quoting as plain text -- you're right, the system shunted it to the moderator because of its size.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">To answer some of the questions/comments raised:</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">> <span style="font-family:Arial,Helvetica,sans-serif">There is a nice summary of some of the ad-hoc solutions we have in Clang in this survey: </span><a href="https://lists.llvm.org/pipermail/cfe-dev/2020-October/066937.html" target="_blank" style="font-family:Arial,Helvetica,sans-serif">[cfe-dev] A survey of dataflow analyses in Clang (llvm.org)</a></div><div><span class="gmail_default" style="font-family:arial,helvetica,sans-serif">> </span>Do you plan to replace the ad-hoc solutions in the long run?<span class="gmail_default" style="font-family:arial,helvetica,sans-serif"></span></div><div><span class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></span></div><div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We mention this briefly in the RFC as motivation, but to reiterate here: we'd like the framework to at least be *suitable* to replace these ad-hoc solutions. Whether we would get to this ourselves or encourage others, review patches, etc. I can't say at this point. I suspect we'll refactor at least a few ourselves just to test out the robustness of the framework.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span style="font-family:Arial,Helvetica,sans-serif">Just to reiterate, you want to abstract away both from the dataflow algorithm and the semantics of C++. E.g., the user might not need to handle the myriad of ways someone can assign a value to a variable (overloaded operator=, output argument, primitive assignment, etc), only a single abstract concept of assignment.</span></blockquote><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">Exactly! Our first version already does this to some extent, we just feel it's somewhat tailored to our current needs and could use some generalization. For example, it would be great to know when an lvalue is read for the last time and written to another lvalue, because that would suggest a potential `std::move`, among other things. Yet, that balance of detail and abstraction is not yet available. We only model the *effects* of reads and writes without explicitly abstracting them (say, tagging each CFG node w/ the set of reads and writes). So, we think there's room to explore the design space here, but also think what we have will be valuable to a number of analyses right away.</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span style="font-family:Arial,Helvetica,sans-serif">While most of the checks in CSA are using symbolic execution, there are also some syntactic rules. One dilemma of implementing a new check is where to put it (CSA? Tidy? Warnings?). While it would be great to get rid of this dilemma, I think we want to keep all of the options open. I just wanted to make it explicit what kinds of layering constraints we have. It should be possible to add CSA checks/features using the new dataflow framework.</span></blockquote><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"><br></span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">Great point! We hadn't thought about this possibility, but there's no good reason to rule out use in CSA. Our key layering constraint is that it be available in clang-tidy. So, if we put it somewhere central, like clang/include/clang/Analysis, would that be suitable for use in CSA? We're certainly open to other suggestions as well.</span></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span style="font-family:Arial,Helvetica,sans-serif">Valeriy's -Wcompletion-handler being<br></span>one of the notable latest additions - a backwards analysis over finite<br>state lattice with a sophisticated system of notes attached to the<br>warning to explain what's happening (generalizing over the latter may<br>possibly be a goal for your framework as well).</blockquote><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Yes, we're definitely interested in being able to produce warnings that give the user a clear idea of what is wrong and (if possible) how to fix it. This is complicated once we involve the SAT solver, since it tends to give us simple yes/no answers. But, actionable warnings seem crucial for successful user adoption. </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote"><span style="font-family:Arial,Helvetica,sans-serif">I'm also very curious about your SAT solver implementation; I have a<br></span><span style="font-family:Arial,Helvetica,sans-serif">recent hobby of plugging various solvers into the static analyzer (cf.<br></span><a href="https://reviews.llvm.org/D110125" rel="noreferrer" target="_blank" style="font-family:Arial,Helvetica,sans-serif">https://reviews.llvm.org/D110125</a><span style="font-family:Arial,Helvetica,sans-serif">) because it obviously needs some -<br></span><span style="font-family:Arial,Helvetica,sans-serif">arguably as much as your endeavor.</span></blockquote><br></div><div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Yes, we'd love to tell you more, but I'll have to defer to sgatev@, since he's the author and most familiar with the details. </div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Cheers,</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Yitzhak</div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 12, 2021 at 5:34 PM Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com">noqnoqneo@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Sounds like the mail didn't hit the mailing list due to size <br>
restriction. I'm quoting in plain text now to make sure it definitely <br>
hits the list in some form.<br>
<br>
Like Gabor already said, It's amazing that this is happening. I <br>
completely agree with your goals and I think we've discussed a lot of <br>
this offline earlier. The amount of existing flow-sensitive warnings may <br>
be low but it's steadily growing, Valeriy's -Wcompletion-handler being <br>
one of the notable latest additions - a backwards analysis over finite <br>
state lattice with a sophisticated system of notes attached to the <br>
warning to explain what's happening (generalizing over the latter may <br>
possibly be a goal for your framework as well).<br>
<br>
Aside from the ability to cover *the other 50% of warnings* compared to <br>
what the static analyzer already covers (may-problems vs. must-problems <br>
aka all-paths problems), another strong benefit of flow sensitive <br>
warnings is their performance which allows us to use them as compiler <br>
warnings (as opposed to separate tools like clang-tidy or static <br>
analyzer). This in turn delivers the warning to every user of the <br>
compiler which increases the user coverage dramatically. This <br>
consideration may even apply to may-problems in which the static <br>
analyzer already excels technically.<br>
<br>
I'm also very curious about your SAT solver implementation; I have a <br>
recent hobby of plugging various solvers into the static analyzer (cf. <br>
<a href="https://reviews.llvm.org/D110125" rel="noreferrer" target="_blank">https://reviews.llvm.org/D110125</a>) because it obviously needs some - <br>
arguably as much as your endeavor.<br>
<br>
<br>
On 10/12/21 8:32 AM, Yitzhak Mandelbaum wrote:<br>
> Summary<br>
><br>
> We propose a new static analysis framework for implementing Clang <br>
> bug-finding and refactoring tools (including ClangTidy checks) based <br>
> on a dataflow algorithm.<br>
><br>
><br>
> Some ClangTidy checks and core compiler warnings are already <br>
> dataflowbased. However, due to a lack of a reusable framework, all <br>
> reimplement common dataflow-analysis and CFG-traversal algorithms. In <br>
> addition to supporting new checks, the framework should allow us to <br>
> clean up those parts of Clang and ClangTidy.<br>
><br>
><br>
> We intend that the new framework should have a learning curve similar <br>
> to AST matchers and not require prior expertise in static analysis. <br>
> Our goal is a simple API that isolates the user from the dataflow <br>
> algorithm itself. The initial version of the framework combines <br>
> support for classic dataflow analysesover the Clang CFG with symbolic <br>
> evaluation of the code and a simple model of local variables and the <br>
> heap. A number of useful checks can be expressed solely in terms of <br>
> this model, saving their implementers from having to reimplement <br>
> modeling of the semantics of C++ AST nodes. Ultimately, we want to <br>
> extend this convenience to a wide variety of checks, so that they can <br>
> be written in terms of the abstract semantics of the program (as <br>
> modeled by the framework), while only matching and modeling AST nodes <br>
> for key parts of their domain (e.g., custom data types or functions).<br>
><br>
><br>
> Motivation<br>
><br>
> Clang has a few subsystems that greatly help with implementing <br>
> bug-finding and refactoring tools:<br>
><br>
> *<br>
><br>
> AST matchersallow one to easily find program fragments whose AST<br>
> matches a given pattern. This works well for finding program<br>
> fragments whose desired semantics translates to a small number of<br>
> possible syntactic expressions (e.g., "find a call to<br>
> strlen()where the argument is a null pointer constant"). Note that<br>
> patterns expressed by AST matchers often check program properties<br>
> that hold for allexecution paths.<br>
><br>
> *<br>
><br>
> Clang Static Analyzercan find program paths that end in a state<br>
> that satisfies a specific user-defined property, regardless of how<br>
> the code was expressed syntactically (e.g., "find a call to<br>
> strlen()where the argument is dynamically equal to null"). CSA<br>
> abstracts from the syntax of the code, letting analyses focus on<br>
> its semantics. While not completely eliding syntax, symbolic<br>
> evaluation abstracts over many syntactic details. Semantically<br>
> equivalent program fragments result in equivalent program states<br>
> computed by the CSA.<br>
><br>
><br>
> When implementing certain bug-finding and refactoring tools, it is <br>
> often necessary to find program fragments where a given property holds <br>
> on all execution paths, while placing as few demands as possible on <br>
> how the user writes their code. The proposed dataflow analysis <br>
> framework aims to satisfy both of these needs simultaneously.<br>
><br>
><br>
> Example: Ensuring that std::optionalis used safely<br>
><br>
> Consider implementing a static analysis tool that proves that every <br>
> time a std::optionalis unwrapped, the program has ensured that the <br>
> precondition of the unwrap operation (the optional has a value) is <br>
> satisfied.<br>
><br>
><br>
> voidTestOK(std::optional<int>x){<br>
><br>
> if(x.has_value()){<br>
><br>
> use(x.value());<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> voidTestWithWarning(std::optional<int>x){<br>
><br>
> use(x.value());// warning: unchecked access to optional value<br>
><br>
> }<br>
><br>
><br>
> If we implemented this check in CSA, it would likely find the bug in <br>
> TestWithWarning, and would not report anything in TestOK. However, CSA <br>
> did not prove the desired property (all optional unwraps are checked <br>
> on all paths) for TestOK; CSA just could not find a path through <br>
> TestOKwhere this property is violated.<br>
><br>
><br>
> A simple function like TestOKonly has two paths and would be <br>
> completely explored by the CSA. However, when loops are involved, CSA <br>
> uses heuristics to decide when to stop exploring paths in a function. <br>
> As a result, CSA would miss this bug:<br>
><br>
><br>
> voidMissedBugByCSA(std::optional<int>x){<br>
><br>
> for(inti =0;i <10;i++){<br>
><br>
> if(i <5){<br>
><br>
> use(i);<br>
><br>
> }else{<br>
><br>
> use(x.value());// CSA does not report a warning<br>
><br>
> }<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> If we implemented this check with AST matchers (as typically done in <br>
> ClangTidy), we would have to enumerate way too many syntactic <br>
> patterns. Consider just a couple coding variations that users expect <br>
> this analysis to understand to be safe:<br>
><br>
><br>
> voidTestOK(std::optional<int>x){<br>
><br>
> if(x){<br>
><br>
> use(x.value());<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> voidTestOK(std::optional<int>x){<br>
><br>
> if(x.has_value()==true){<br>
><br>
> use(x.value());<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> voidTestOK(std::optional<int>x){<br>
><br>
> if(!x.has_value()){<br>
><br>
> use(0);<br>
><br>
> }else{<br>
><br>
> use(x.value());<br>
><br>
> }<br>
><br>
> }<br>
><br>
><br>
> voidTestOK(std::optional<int>x){<br>
><br>
> if(!x.has_value())<br>
><br>
> return;<br>
><br>
> use(x.value());<br>
><br>
> }<br>
><br>
><br>
> To summarize, CSA can find someoptional-unwrapping bugs, but it does <br>
> not prove that the code performs a check on allpaths. AST matchers can <br>
> prove it for a finite set of hard-coded syntactic cases. But, when the <br>
> set of safe patterns is infinite, as is often the case, AST matchers <br>
> can only capture a subset. Capturing the full set requires building a <br>
> framework like the one presented in this document.<br>
><br>
><br>
> Comparison<br>
><br>
><br>
> <br>
><br>
> AST Matchers Library<br>
><br>
> <br>
><br>
> Clang Static Analyzer<br>
><br>
> <br>
><br>
> Dataflow Analysis Framework<br>
><br>
> Proves that a property holds on all execution paths<br>
><br>
> <br>
><br>
> Possible, depends on the matcher<br>
><br>
> <br>
><br>
> No<br>
><br>
> <br>
><br>
> Yes<br>
><br>
> Checks can be written in terms of program semantics<br>
><br>
> <br>
><br>
> No<br>
><br>
> <br>
><br>
> Yes<br>
><br>
> <br>
><br>
> Yes<br>
><br>
><br>
> Further Examples<br>
><br>
> While we have started with a std::optionalchecker, we have a number of <br>
> further use cases where this framework can help. Note that some of <br>
> these are already implemented in complex, specialized checks, either <br>
> directly in the compiler or as clang-tidy checks. A key goal of our <br>
> framework is to simplify the implementation of such checks in the future.<br>
><br>
> *<br>
><br>
> “the pointer may be null when dereferenced”, “the pointer is<br>
> always null when dereferenced”,<br>
><br>
> *<br>
><br>
> "unnecessary null check on a pointer passed to delete/free",<br>
><br>
> *<br>
><br>
> “this raw pointer variable could be refactored into a<br>
> std::unique_ptr”,<br>
><br>
> *<br>
><br>
> “expression always evaluates to a constant”, “if condition always<br>
> evaluates to true/false”,<br>
><br>
> *<br>
><br>
> “basic block is never executed”, “loop is executed at most once”,<br>
><br>
> *<br>
><br>
> “assigned value is always overwritten”,<br>
><br>
> *<br>
><br>
> “value was used after having been moved”,<br>
><br>
> *<br>
><br>
> “value guarded by a lock was used without taking the lock”,<br>
><br>
> *<br>
><br>
> “unnecessary existence check before inserting a value into a<br>
> std::map”.<br>
><br>
><br>
> Some of these use cases are covered by existing ClangTidy checks that <br>
> are implemented with pattern matching. We believe there is an <br>
> opportunity to simplify their implementation by refactoring them to <br>
> use the dataflow framework, and simultaneously improve the coverage <br>
> and precision of how they model C++ semantics.<br>
><br>
><br>
> Overview<br>
><br>
> The initial version of our framework provides the following key features:<br>
><br>
> *<br>
><br>
> a forward dataflow algorithm, parameterized by a user-defined<br>
> value domain and a transfer function, which describes how program<br>
> statements modify domain values.<br>
><br>
> *<br>
><br>
> a built-in model of local variables and the heap. This model<br>
> provides two benefits to users: it tracks the flow of values,<br>
> accounting for a wide variety of C++ language constructs, in a way<br>
> that is reusable across many different analyses; it discriminates<br>
> value approximations based on path conditions, allowing for higher<br>
> precision than dataflow alone.<br>
><br>
> *<br>
><br>
> an interface to an SMT solver to verify safety of operations like<br>
> dereferencing a pointer,<br>
><br>
> *<br>
><br>
> test infrastructure for validating results of the dataflow algorithm.<br>
><br>
><br>
> Dataflow Algorithm<br>
><br>
> Our framework solves forward dataflow equations over a product of a <br>
> user-defined value domain and a built-in model of local variables and <br>
> the heap, called the environment. The user implements the transfer <br>
> function for their value domain, while the framework implements the <br>
> transfer function that operates on the environment, only requiring a <br>
> user-defined widening operator to avoid infinite growth of the <br>
> environment model. We intend to expand the framework to also support <br>
> backwards dataflow equations at some point in the future. The <br>
> limitation to the forward direction is not essential.<br>
><br>
><br>
> In more detail, the algorithm is parameterized by:<br>
><br>
><br>
> *<br>
><br>
> a user-defined, finite-height partial order, with a join function<br>
> and maximal element,<br>
><br>
> *<br>
><br>
> an initial element from the user-defined lattice,<br>
><br>
> *<br>
><br>
> a (monotonic) transfer function for statements, which maps between<br>
> lattice elements,<br>
><br>
> *<br>
><br>
> comparison and merge functions for environment values.<br>
><br>
><br>
> We expand on the first three elements here and return to the <br>
> comparison and merge functions in the next section.<br>
><br>
><br>
> The Lattice and Analysis Abstractions<br>
><br>
> We start with a representation of a lattice:<br>
><br>
> classBoundedJoinSemiLattice{<br>
><br>
> public:<br>
><br>
> friendbooloperator==(constBoundedJoinSemiLattice&lhs,<br>
><br>
> constBoundedJoinSemiLattice&rhs);<br>
><br>
> friendbooloperator<=(constBoundedJoinSemiLattice&lhs,<br>
><br>
> constBoundedJoinSemiLattice&rhs);<br>
><br>
> staticBoundedJoinSemiLatticeTop();<br>
><br>
> enumclassJoinEffect{<br>
><br>
> Changed,<br>
><br>
> Unchanged,<br>
><br>
> };<br>
><br>
> JoinEffectJoin(constBoundedJoinSemiLattice&element);<br>
><br>
> };<br>
><br>
> Not all of these declarations are necessary. Specifically, <br>
> operator==and operator<=can be implemented in terms of Joinand <br>
> Topisn’t used by the dataflow algorithm. We include the operators <br>
> because direct implementations are likely more efficient, and <br>
> Topbecause it provides evidence of a well-founded bounded join <br>
> semi-lattice.<br>
><br>
><br>
> The Joinoperation is unusual in its choice of mutating the object <br>
> instance and returning an indication of whether the join resulted in <br>
> any changes. A key step of the fixpoint algorithm is code like j = <br>
> Join(x, y); if (j == x) …. This formulation of Joinallows us to <br>
> express this code instead as if (x.Join(y) == Unchanged).... For <br>
> lattices whose instances are expensive to create or compare, we expect <br>
> this formulation to improve the performance of the dataflow analysis.<br>
><br>
><br>
> Based on the lattice abstraction, our analysis is similarly traditional:<br>
><br>
> classDataflowAnalysis{<br>
><br>
> public:<br>
><br>
> usingBoundedJoinSemiLattice=…;<br>
><br>
> BoundedJoinSemiLatticeInitialElement();<br>
><br>
> BoundedJoinSemiLatticeTransferStmt(<br>
><br>
> constclang::Stmt*stmt,constBoundedJoinSemiLattice&element,<br>
><br>
> clang::dataflow::Environment&environment);<br>
><br>
> };<br>
><br>
><br>
> Environment: A Built-in Lattice<br>
><br>
> All user-defined operations have access to an environment, which <br>
> encapsulates the program context of whatever program element is being <br>
> considered. It contains:<br>
><br>
> *<br>
><br>
> a path condition,<br>
><br>
> *<br>
><br>
> a storage model (which models both local variables and the heap).<br>
><br>
><br>
> The built-in transfer functions model the effects of C++ language <br>
> constructs (for example, variable declarations, assignments, pointer <br>
> dereferences, arithmetic etc.) by manipulating the environment, saving <br>
> the user from doing so in custom transfer functions.<br>
><br>
><br>
> The path condition accumulates the set of boolean conditions that are <br>
> known to be true on every path from function entry to the current <br>
> program point.<br>
><br>
><br>
> The environment maintains the maps from program declarations and <br>
> pointer values to storage locations. It also maintains the map of <br>
> storage locations to abstract values. Storage locations can be atomic <br>
> (“scalar”) or aggregate multiple sublocations:<br>
><br>
> *<br>
><br>
> ScalarStorageLocationis a storage location that is not subdivided<br>
> further for the purposes of abstract interpretation, for example<br>
> bool, int*.<br>
><br>
> *<br>
><br>
> AggregateStorageLocationis a storage location which can be<br>
> subdivided into smaller storage locations that can be<br>
> independently tracked by abstract interpretation. For example, a<br>
> struct with public members.<br>
><br>
><br>
> In addition to this storage structure, our model tracks three kinds of <br>
> values: basic values (like integers and booleans), pointers (which <br>
> reify storage location into the value space) and records with named <br>
> fields, where each field is itself a value. For aggregate values, we <br>
> ensure coherence between the structure represented in storage <br>
> locations and that represented in the values: for a given aggregate <br>
> storage location Laggwith child f:<br>
><br>
> *<br>
><br>
> valueAt(Lagg) is a record, and<br>
><br>
> *<br>
><br>
> valueAt(child(Lagg, f)) = child(valueAt(Lagg), f),<br>
><br>
> where valueAtmaps storage locations to values.<br>
><br>
><br>
> In our first iteration, the only basic values that we will track are <br>
> boolean values. Additional values and variables may be relevant, but <br>
> only in so far as they constitute part of a boolean expression that we <br>
> care about. For example, if (x > 5) { … }will result in our tracking x <br>
> > 5, but not any particular knowledge about the integer-valued variable x.<br>
><br>
><br>
> The path conditions and the model refer to the same sets of values and <br>
> locations.<br>
><br>
><br>
> Accounting for infinite height<br>
><br>
> Despite the simplicity of our storage model, its elements can grow <br>
> infinitely large: for example, path conditions on loop back edges, <br>
> records when modeling linked lists, and even just the number of <br>
> locations for a simple for-loop. Therefore, we provide the user with <br>
> two opportunities to interpret the abstract values so as to bound it: <br>
> comparison of abstract values and a merge operation at control-flow joins.<br>
><br>
><br>
> // Compares values from two different environments for semantic <br>
> equivalence.<br>
><br>
> boolCompare(<br>
><br>
> clang::dataflow::Value*value1,clang::dataflow::Environment&env1,<br>
><br>
> clang::dataflow::Value*value2,clang::dataflow::Environment&env2);<br>
><br>
><br>
> // Given `value1` (w.r.t. `env1`) and `value2` (w.r.t. `env2`), returns a<br>
><br>
> // `Value` (w.r.t. `merged_env`) that approximates `value1` and <br>
> `value2`. This<br>
><br>
> // could be a strict lattice join, or a more general widening operation.<br>
><br>
> clang::dataflow::Value*Merge(<br>
><br>
> clang::dataflow::Value*value1,clang::dataflow::Environment&env1,<br>
><br>
> clang::dataflow::Value*value2,clang::dataflow::Environment&env2,<br>
><br>
> clang::dataflow::Environment&merged_env);<br>
><br>
><br>
> Verification by SAT Solver<br>
><br>
> The path conditions are critical in refining our approximation of the <br>
> local variables and heap. At an abstract level, they don’t need to <br>
> directly influence the execution of the dataflow analysis. Instead, we <br>
> can imagine that, once concluded, our results are annotated <br>
> with/incorporate path conditions, which in turn allows us to make more <br>
> precise conclusions about our program from the analysis results.<br>
><br>
><br>
> For example, if the analysis concludes that a variable conditionally <br>
> holds a non-null pointer, and, for a given program point, the path <br>
> condition implies said condition, we can conclude that a dereference <br>
> of the pointer at that point is safe. In this sense, the verification <br>
> of the safety of a deference is separate from the analysis, which <br>
> models memory states with formulae. That said, invocation of the <br>
> solver need not wait until after the analysis has concluded. It can <br>
> also be used during the analysis, for example, in defining the <br>
> widening operation in Merge.<br>
><br>
><br>
> With that, we can see the role of an SAT solver: to solve questions of <br>
> the form “Does the path condition satisfy the pre-condition computed <br>
> for this pointer variable at this program point”? To that end, we <br>
> include an API for asking such questions of a SAT solver. The API is <br>
> compatible with Z3 and we also include our own (simple) SAT solver, <br>
> for users who may not want to integrate with Z3.<br>
><br>
><br>
> Testing dataflow analysis<br>
><br>
> We provide test infrastructure that allows users to easily write tests <br>
> that make assertions about a program state computed by dataflow <br>
> analysis at any code point. Code is annotated with labels on points of <br>
> interest and then test expectations are written with respect to the <br>
> labeled points.<br>
><br>
><br>
> For example, consider an analysis that computes the state of <br>
> optionalvalues.<br>
><br>
> ExpectDataflowAnalysis<OptionalChecker>(R"(<br>
><br>
> void f(std::optional<int> opt) {<br>
><br>
> if (opt.has_value()) {<br>
><br>
> // [[true-branch]]<br>
><br>
> } else {<br>
><br>
> // [[false-branch]]<br>
><br>
> }<br>
><br>
> })",[](constauto&results){<br>
><br>
> EXPECT_LE(results.ValueAtCodePoint("true-branch"),OptionalLattice::Engaged());<br>
><br>
> EXPECT_LE(results.ValueAtCodePoint("false-branch"),OptionalLattice::NullOpt());<br>
><br>
> });<br>
><br>
><br>
> Timeline<br>
><br>
> We have a fully working implementation of the framework as proposed <br>
> here. We also are interested in contributing a clang-tidy checker for <br>
> std::optionaland related types as soon as possible, given its <br>
> demonstrated efficacy and interest from third parties. To that end, <br>
> as soon as we have consensus for starting this work in the clang <br>
> repository, we plan to start sending patches for review. Given that <br>
> the framework is relatively small, we expect it will fit within 5 <br>
> (largish) patches, meaning we can hope to have it available by <br>
> December 2021.<br>
><br>
><br>
> Future Work<br>
><br>
> We consider our first version of the framework as experimental. It <br>
> suits the needs of a number of particular checks that we’ve been <br>
> writing, but it has some obvious limitations with respect to general <br>
> use. After we commit our first version, here are some planned areas <br>
> of improvement and investigation.<br>
><br>
><br>
> Backwards analyses.The framework only solves forward dataflow <br>
> equations. This clearly prevents implementing some very familiar and <br>
> useful analyses, like variable liveness. We intend to expand the <br>
> framework to also support backwards dataflow equations. There’s <br>
> nothing inherent to our framework that limits it to forwards.<br>
><br>
><br>
> Reasoning about non-boolean values.Currently, we can symbolically <br>
> model equality of booleans, but nothing else. Yet, for locations on <br>
> the store that haven’t changed, we can understand which values are <br>
> identical (because the input value object is passed by the transfer <br>
> function to the output). For example,<br>
><br>
> voidf(bool*a,bool*b){<br>
><br>
> std::optional<int>opt;<br>
><br>
> if(a ==b){opt =42;}<br>
><br>
> if(a ==b){opt.value();}<br>
><br>
> }<br>
><br>
> Since neither anor bhave changed, we can reason that the two <br>
> occurrences of the expression a == bevaluate to the same value (even <br>
> though we don't know the exact value). We intend to improve the <br>
> framework to support this increased precision, given that it will not <br>
> cost us any additional complexity.<br>
><br>
><br>
> Improved model of program state.A major obstacle to writing <br>
> high-quality checks for C++ code is the complexity of C++’s <br>
> semantics. In writing our framework, we’ve come to see the value of <br>
> shielding the user from the need to handle these directly by modeling <br>
> program state and letting the user express their check in terms of <br>
> that model, after it has been solved for each program point. As is, <br>
> our model currently supports a number of checks, including the <br>
> std::optionalcheck and another that infers which pointer parameters <br>
> are out parameters. We intend to expand the set of such checks, with <br>
> improvements to the model and the API for expressing domain-specific <br>
> semantics relevant to a check. Additionally, we are considering <br>
> explicitly tracking other abstract program properties, like reads and <br>
> writes to local variables and memory, so that users need not account <br>
> for all the syntactic variants of these operations.<br>
><br>
><br>
> Refactoring existing checks.Once the framework is more mature and <br>
> stable, we would like to refactor existing checks that use custom, <br>
> ad-hoc implementations of dataflow to use this framework.<br>
><br>
><br>
> Conclusion<br>
><br>
> We have proposed an addition to the Clang repository of a new static <br>
> analysis framework targeting all-pathsanalyses. The intended clients <br>
> of such analyses are code-checking and transformation tools like <br>
> clang-tidy, where sound analyses can support sound transformations of <br>
> code. We discussed the key features and APIs of our framework and how <br>
> they fit together into a new analysis framework. Moreover, our <br>
> proposal is based on experience with a working implementation that <br>
> supports a number of practical, high-precision analyses, including one <br>
> for safe use of types like std::optionaland another that finds missed <br>
> opportunities for use of std::move.<br>
><br>
><br>
><br>
<br>
</blockquote></div>