[cfe-dev] [analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Tue May 22 16:11:17 PDT 2018

21.05.2018 21:53, Artem Dergachev пишет:
>
>
> On 5/16/18 4:37 PM, Alexey Sidorin wrote:
>> Hello everyone,
>>
>> I'd like to share some results of my investigation targeted on 
>> improvement of Clang Static Analyzer checker API. You can find some 
>> previous conversation on this topic here: 
>> http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. 
>> In my investigation, I tried to solve a particular problem of 
>> building a checker without generating new nodes.
>>
>> --------- Introduction and design goals ---------
>>
>> In brief, I tried to use matchers-like API to make CSA checkers look 
>> like this:
>>
>> StatementMatcher NotChdir =
>>     callExpr(unless(callee(functionDecl(hasName("::chdir")))));
>> Finder.addMatcher(
>>     hasSequence(
>>         postStmt(hasStatement(
>> callExpr(callee(functionDecl(hasName("::chroot")))))),
>>         unless(stmtPoint(hasStatement(callExpr(
>>             callee(functionDecl(hasName("::chdir"))),
>>             hasArgument(0, 
>> hasValue(stringRegion(refersString("/")))))))),
>>         explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
>>                             callEnter(hasCallExpr(NotChdir))))
>>             .bind("bug_node")),
>>     &Callback);
>> Finder.match(G);
>>
>> and I have managed to make some simple working examples.
>>
>> The entire diff can be found here: 
>> https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
>> The code itself: https://github.com/a-sid/clang/tree/a4
>>
>> There are several reasons why I have tried this approach.
>>
>> 1. AST Matchers are already extensively used API for AST checking. 
>> They are available both in Clang-Tidy and CSA. And I wanted to use 
>> existing functionality as much as possible. So, I decided to extend 
>> an existing API to make its usage seamless across different kinds of 
>> checks: path-sensitive, AST-based and CFG-based.
>>
>> 2. AST matchers effectively help clients to avoid a lot of checking 
>> of dyn_cast results. This feature not only makes them more convenient 
>> but also more safe because, in my experience, forgetting a 
>> nullptr/None check is a pretty common mistake for checker writers.
>>
>> 3. Testing of AST matchers don't require writing C++ code - it can be 
>> done interactively with clang-query tool. And I believe that we need 
>> similar functionality for CSA as well.
>>
>> I didn't want to replace existing checker API. Instead, I tried to 
>> make new possibilities usable independently or together with existing.
>>
>> --------- Design and implementation ---------
>>
>> The implementation consisted of a number of steps.
>>
>> 1. Allow matching nodes of path-sensitive graph like usual AST nodes. 
>> To make this possible, three actions were needed:
>>  1.1. ASTTypeTraits and DynTypedNode were extended to support 
>> path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. 
>> The implementation with graph node support is moved into a separate 
>> class (ASTGraphTypeTraits) in ento namespace to resolve cyclic 
>> dependencies (they are still exist, unfortunately, but are 
>> header-only, so we can build the PoC).
>>  1.2. Some additions to AST matchers were made to add support for new 
>> kinds of nodes.
>>  1.3. To make MatchFinder able to query specific options not 
>> supported by pure AST, it was augmented with "Contexts". A matcher 
>> that needs to query the path-sensitive engine asks the Finder for the 
>> required Context which provides specific helper functions.
>>
>> As the result of this step, we are able to write matchers like 
>> expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).
>>
>> 2. Create an engine for graph exploration and matching.
>>   Unlike normal AST matchers, hasSequence is a path-sensitive 
>> matcher. It is launched against ExplodedGraph. These matchers are 
>> handled by GraphMatchFinder object. While searching a graph, it 
>> collects matches. Each match contains a pointer to the corresponding 
>> matcher and State ID of this match. The way how matches are 
>> translated from one state ID to another is determined by matcher 
>> operators.
>>
>>   For example, SequenceVariadicOperator, which is the base of 
>> hasSequence() matcher, has "positive" and "negative" sub-matches. 
>> Each positive matcher has its corresponding State ID. In order to 
>> advance to the next State ID, a node being matched should match all 
>> "negative" matchers before the next "positive" matchers and the next 
>> "positive" matcher itself. Failure in matching "negative" matcher 
>> discards the match.
>>
>>   The role of GraphMatchFinder is similar to MatchFinder: it is only 
>> responsible for graph exploration and keeping bound nodes and matchers.
>>
>> 3. Manage bound nodes.
>>   While matching a single graph node, BoundNodes from AST MatchFinder 
>> are used. AST matchers for path-sensitive nodes support bindings 
>> out-of-the-box. However, the situation with graph matching is a bit 
>> different. For graph matching, we have another system of bound nodes. 
>> Each graph node has a related map of bounds aka GDMTy (yes, the name 
>> is not a coincidence). GDMTy is a mapping from match ID to 
>> BoundNodesMap which, in part, is effectively a map from std::string 
>> to DynTypedNodes. This look pretty much like how GDM is organized in 
>> ExplodedGraph, just without one level of indirection (it can be 
>> added, though).
>>
>>   MatchFinder contexts should allow support of their own bindings. 
>> This is how equalsBoundNode() matcher works for path-sensitive nodes: 
>> it just queries all available contexts for the binding.
>>
>> Finally, I have managed to make two checkers work in this way: 
>> ChrootChecker (which is present in the introduction) and 
>> TestAfterDivZeroChecker. Both them can be found in 
>> ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp 
>> correspondingly. They act like normal checkers: produce warnings and 
>> use visitors. The main difference is that they cannot generate sinks 
>> and use checkEndAnalysis callback. The code of these checkers can be 
>> found here:
>>
>> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp 
>>
>> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp 
>>
>>
>>
>> -------- Some features --------
>>
>> 1.The design of bindings has an interesting consequence: it enables 
>> the dynamic introspection of GDM which was pretty hard before. (Hello 
>> alpha renaming?)
>> 2. Nothing prevents matchers to be used with existing checker API for 
>> simplifying conditional checking inside callbacks. The matchers are 
>> not planned as the replacement for the current API, but look like a 
>> nice complementary part.
>> 3. Implicit conversion of Matcher<ProgramPoint> to 
>> Matcher<ExplodedNode> and Matcher<SymExpr || MemRegion> to 
>> Matcher<SVal> for writing shorter code.
>>
>> -------- Not implemented/not checked yet --------
>> I tried to keep the PoC as minimal as possible. As the result, some 
>> features are not implemented yet, but I believe they can be added.
>> 1. Usage of matchers inside checker callbacks
>>   This is not exactly unimplemented, but is still untested.
>> 2. Dynamic matchers (clang-query interface)
>>   In addition to work for supporting dynamic nodes (I don't know how 
>> many efforts it can take), we need to enable matching against AST 
>> nodes, not graph. But it doesn't look as a problem, we can just make 
>> matchers polymorphic.
>> 3. Binding to successfully matched paths is not implemented yet - 
>> only bindings to separate nodes. I wonder if we need path bindings at 
>> all.
>> 4. Some corner cases are still FIXMEs: single-node sequences, for 
>> example.
>> 5. GDM is not based on immutable data structures like it is done in 
>> CSA - it is just an STL map. Its performance can be poor (full copy 
>> on every new node), but I don't think that changing it to use 
>> immutable structures is hard.
>> 6. Matching on-the-fly
>>   GraphMatchFinder should support on-the-fly matching during 
>> ExplodedGraph building. For this support, we should just call its 
>> advance() method each time a new node is created. However, this 
>> possibility wasn't checked yet.
>> 7. Matching CFG and CallGraph isn't implemented because it was 
>> considered to be far out of simple PoC.
>> 8. Only sequential matching is available now, and I didn't try to 
>> implement other operators yet. So, implementing a checker similar to 
>> PthreadLock can be tricky or even impossible for now.
>>
>> -------- Known and potential issues --------
>> From matchers' side:
>> 1. Some tests don't pass because they rely on the checker ability to 
>> generate sink nodes. Our matchers cannot do it by design so tests 
>> don't pass.
>> 2. Representing checker bindings as a map can be less effective than 
>> storing data in structures. I didn't measure the overhead, so I 
>> cannot give any numbers.
>> 3. Matchers are called on every node independently of its type. This 
>> is not what CheckerManager does. I wonder if this detail can affect 
>> performance as well.
>> 4. Problems with cyclic dependencies. clangStaticAnalyzer has a 
>> dependency on clangASTMatchers, therefore, clangASTMatchers cannot 
>> depend on clangStaticAnalyzer. However, if we want ASTMatchers to 
>> support static analyzer data structures, there should be a backward 
>> dependency. Now this dependency is header-only.
>> 5. Code duplication. This is mostly fine for a sandbox but needs to 
>> be reduced.
>>
>> From analyzer's side:
>> 1. Many events are not reflected in the final ExplodedGraph. For 
>> example, checkers can receive PointerEscape event
>
> Hmm, what if we "invert" this event, i.e. make it opt-out rather than 
> opt-in, i.e., pointers are escaped by the matcher engine 
> automatically, but if the checker wants a certain event to not escape 
> the pointer, it adds extra matching code, such as "the sequence may 
> include 0 or more passes of the symbol to a certain set of functions 
> in a certain manner".
i'm not sure this will work for other kinds of "flaky" nodes - for 
PostCondition, for example. PostCondition nodes can only be generated by 
checkers and cannot be matched without dirty hacks (like force 
generation of such nodes).
>
>> , but the event itself will not be recorded in the ExplodedGraph - 
>> only changes made by checkers will. That's also true for Stmt nodes: 
>> I noticed same issues with PostCondition. This makes matching a bit 
>> harder. Should we add some information into ExplodedGraph?
>> 2. (Minor) Some static analyzer data structures don't support 
>> traditional LLVM casting methods (dyn_cast, isa) because they lack 
>> classof() method. I have fixed this internally and will put a patch 
>> on review soon.
>>
>> -------- Conclusion --------
>> Finally, there is a graph matching engine supporting basic 
>> functionality and two checkers using it. I apologize for not starting 
>> the discussion earlier, but I just wasn't sure that the approach will 
>> work. Is anybody interested to have this stuff in upstream (maybe, 
>> changed or improved)? If yes, I'll be happy to contribute this work 
>> patch-by-patch and continue its improvement. If no - well, I had 
>> enough fun playing with it.
>>
>