[cfe-dev] [analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Mon May 21 18:38:00 PDT 2018

> On May 17, 2018, at 5:28 AM, Gábor Horváth via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> Hi Alexey,
> 
> In general, I like the idea of having a more declarative way to define new
> checks.
> 
> On Thu, 17 May 2018 at 01:37, Alexey Sidorin <alexey.v.sidorin at ya.ru <mailto:alexey.v.sidorin at ya.ru>> wrote:
> Hello everyone,
> 
> I'd like to share some results of my investigation targeted on 
> improvement of Clang Static Analyzer checker API. You can find some 
> previous conversation on this topic here: 
> http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html <http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html>. 
> In my investigation, I tried to solve a particular problem of building a 
> checker without generating new nodes.
> 
> Why do you focus on such checks that does not have traits, does not generate new nodes. 
> Do you find this to be the majority of the checks you need to implement? 
>  
> 
> --------- Introduction and design goals ---------
> 
> In brief, I tried to use matchers-like API to make CSA checkers look 
> like this:
> 
> StatementMatcher NotChdir =
>      callExpr(unless(callee(functionDecl(hasName("::chdir")))));
> Finder.addMatcher(
>      hasSequence(
>          postStmt(hasStatement(
>              callExpr(callee(functionDecl(hasName("::chroot")))))),
>          unless(stmtPoint(hasStatement(callExpr(
>              callee(functionDecl(hasName("::chdir"))),
>              hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
>          explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
>                              callEnter(hasCallExpr(NotChdir))))
>              .bind("bug_node")),
>      &Callback);
> Finder.match(G);
> 
> I think, a common (design) pitfall of writing checks is to try to match against
> the AST when a symbolic execution callback should be used instead.
> I am a bit afraid having an API like this would make this pitfall more common.
> Maybe a separation between the path sensitive, flow sensitive and
> AST based checks is good for the checker writers and new users and
> I am not sure that the same abstraction fits all of these cases.
> 
> In case of path sensitive checks I wonder if a state machine like abstraction would
> fit the use cases better (and also cover checks that are using traits)
> where the guarding conditions of the state transitions can include AST matcher like checks. 
>  
> 
> and I have managed to make some simple working examples.
> 
> The entire diff can be found here: 
> https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f <https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f>
> The code itself: https://github.com/a-sid/clang/tree/a4 <https://github.com/a-sid/clang/tree/a4>
> 
> There are several reasons why I have tried this approach.
> 
> 1. AST Matchers are already extensively used API for AST checking. They 
> are available both in Clang-Tidy and CSA. And I wanted to use existing 
> functionality as much as possible. So, I decided to extend an existing 
> API to make its usage seamless across different kinds of checks: 
> path-sensitive, AST-based and CFG-based.
> 
> I am not sure if this is a good idea. I think the checker author supposed to
> be aware if the check is AST based, flow sensitive, or path sensitive. 
> And this should also be easy to tell by reading the code. I am also not
> sure whether there is an abstraction that fits all.
> 
> I think the reason why this idea works well for checks that only inspect the
> exploded graph, because in this case we are still doing pattern matching on
> graphs in a similar manner as doing pattern matching on the AST.
> 
> But does this generalize later on to stateful checks? 

From my understanding it would only allow matching properties representable as finite automatons,
so e.g. no retain/release matchers.

>  
> 
> 2. AST matchers effectively help clients to avoid a lot of checking of 
> dyn_cast results. This feature not only makes them more convenient but 
> also more safe because, in my experience, forgetting a nullptr/None 
> check is a pretty common mistake for checker writers.
> 
> I think if something cannot be null we should return references, otherwise
> the client need to check for the pointer beeing null (unless there are some
> dynamic invariant that would make the check redundant). If an API does
> not match this philosophy, maybe we should fix the API first. 
> 
> Regardless of fixing the API, I agree that it would be great to have higher
> level APIs for checker authors.
>  
> 
> 3. Testing of AST matchers don't require writing C++ code - it can be 
> done interactively with clang-query tool. And I believe that we need 
> similar functionality for CSA as well.
> 
> Having a query tool for exploded graphs could be very helpful, I agree.
> These graphs tend to be very large and sometimes it is not trivial to
> find the relevant parts during debugging.
>  
> 
> I didn't want to replace existing checker API. Instead, I tried to make 
> new possibilities usable independently or together with existing.
> 
> --------- Design and implementation ---------
> 
> The implementation consisted of a number of steps.
> 
> 1. Allow matching nodes of path-sensitive graph like usual AST nodes. To 
> make this possible, three actions were needed:
>   1.1. ASTTypeTraits and DynTypedNode were extended to support 
> path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. 
> 
> I wonder, if in general it is a good idea to pattern match in checks on SymExpr.
> A more general approach here would be to use the constraint solver for queries
> in a similar manner how ProgramState::assume works.
>  
> The implementation with graph node support is moved into a separate 
> class (ASTGraphTypeTraits) in ento namespace to resolve cyclic 
> dependencies (they are still exist, unfortunately, but are header-only, 
> so we can build the PoC).
>   1.2. Some additions to AST matchers were made to add support for new 
> kinds of nodes.
>   1.3. To make MatchFinder able to query specific options not supported 
> by pure AST, it was augmented with "Contexts". A matcher that needs to 
> query the path-sensitive engine asks the Finder for the required Context 
> which provides specific helper functions.
> 
> As the result of this step, we are able to write matchers like 
> expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).
> 
> I think this is the part of the proposal that I like the most, it would be
> a very concise way to write down guarding conditions.
>  
> 
> 2. Create an engine for graph exploration and matching.
>    Unlike normal AST matchers, hasSequence is a path-sensitive matcher. 
> It is launched against ExplodedGraph. These matchers are handled by 
> GraphMatchFinder object. While searching a graph, it collects matches. 
> Each match contains a pointer to the corresponding matcher and State ID 
> of this match. The way how matches are translated from one state ID to 
> another is determined by matcher operators.
> 
> Is this matching done after the exploded graph already built? If so,
> these checks will be unable to generate sink nodes. I think having sink
> nodes sometimes desirable even if the check itself does not have a trait.
>  
> 
>    For example, SequenceVariadicOperator, which is the base of 
> hasSequence() matcher, has "positive" and "negative" sub-matches. Each 
> positive matcher has its corresponding State ID. In order to advance to 
> the next State ID, a node being matched should match all "negative" 
> matchers before the next "positive" matchers and the next "positive" 
> matcher itself. Failure in matching "negative" matcher discards the match.
> 
>    The role of GraphMatchFinder is similar to MatchFinder: it is only 
> responsible for graph exploration and keeping bound nodes and matchers.
> 
> 3. Manage bound nodes.
>    While matching a single graph node, BoundNodes from AST MatchFinder 
> are used. AST matchers for path-sensitive nodes support bindings 
> out-of-the-box. However, the situation with graph matching is a bit 
> different. For graph matching, we have another system of bound nodes. 
> Each graph node has a related map of bounds aka GDMTy (yes, the name is 
> not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap 
> which, in part, is effectively a map from std::string to DynTypedNodes. 
> This look pretty much like how GDM is organized in ExplodedGraph, just 
> without one level of indirection (it can be added, though).
> 
>    MatchFinder contexts should allow support of their own bindings. This 
> is how equalsBoundNode() matcher works for path-sensitive nodes: it just 
> queries all available contexts for the binding.
> 
> Finally, I have managed to make two checkers work in this way: 
> ChrootChecker (which is present in the introduction) and 
> TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp 
> and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal 
> checkers: produce warnings and use visitors. The main difference is that 
> they cannot generate sinks and use checkEndAnalysis callback. The code 
> of these checkers can be found here:
> 
> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp <https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp>
> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp <https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp>
> 
> 
> -------- Some features --------
> 
> 1.The design of bindings has an interesting consequence: it enables the 
> dynamic introspection of GDM which was pretty hard before. (Hello alpha 
> renaming?)
> 2. Nothing prevents matchers to be used with existing checker API for 
> simplifying conditional checking inside callbacks. The matchers are not 
> planned as the replacement for the current API, but look like a nice 
> complementary part.
> 3. Implicit conversion of Matcher<ProgramPoint> to Matcher<ExplodedNode> 
> and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing shorter code.
> 
> -------- Not implemented/not checked yet --------
> I tried to keep the PoC as minimal as possible. As the result, some 
> features are not implemented yet, but I believe they can be added.
> 1. Usage of matchers inside checker callbacks
>    This is not exactly unimplemented, but is still untested.
> 2. Dynamic matchers (clang-query interface)
>    In addition to work for supporting dynamic nodes (I don't know how 
> many efforts it can take), we need to enable matching against AST nodes, 
> not graph. But it doesn't look as a problem, we can just make matchers 
> polymorphic.
> 3. Binding to successfully matched paths is not implemented yet - only 
> bindings to separate nodes. I wonder if we need path bindings at all.
> 4. Some corner cases are still FIXMEs: single-node sequences, for example.
> 5. GDM is not based on immutable data structures like it is done in CSA 
> - it is just an STL map. Its performance can be poor (full copy on every 
> new node), but I don't think that changing it to use immutable 
> structures is hard.
> 6. Matching on-the-fly
>    GraphMatchFinder should support on-the-fly matching during 
> ExplodedGraph building. For this support, we should just call its 
> advance() method each time a new node is created. However, this 
> possibility wasn't checked yet.
> 7. Matching CFG and CallGraph isn't implemented because it was 
> considered to be far out of simple PoC.
> 8. Only sequential matching is available now, and I didn't try to 
> implement other operators yet. So, implementing a checker similar to 
> PthreadLock can be tricky or even impossible for now.
> 
> -------- Known and potential issues --------
>  From matchers' side:
> 1. Some tests don't pass because they rely on the checker ability to 
> generate sink nodes. Our matchers cannot do it by design so tests don't 
> pass.
> 2. Representing checker bindings as a map can be less effective than 
> storing data in structures. I didn't measure the overhead, so I cannot 
> give any numbers.
> 3. Matchers are called on every node independently of its type. This is 
> not what CheckerManager does. I wonder if this detail can affect 
> performance as well.
> 4. Problems with cyclic dependencies. clangStaticAnalyzer has a 
> dependency on clangASTMatchers, therefore, clangASTMatchers cannot 
> depend on clangStaticAnalyzer. However, if we want ASTMatchers to 
> support static analyzer data structures, there should be a backward 
> dependency. Now this dependency is header-only.
> 5. Code duplication. This is mostly fine for a sandbox but needs to be 
> reduced.
> 
>  From analyzer's side:
> 1. Many events are not reflected in the final ExplodedGraph. For 
> example, checkers can receive PointerEscape event, but the event itself 
> will not be recorded in the ExplodedGraph - only changes made by 
> checkers will. That's also true for Stmt nodes: I noticed same issues 
> with PostCondition. This makes matching a bit harder. Should we add some 
> information into ExplodedGraph?
> 2. (Minor) Some static analyzer data structures don't support 
> traditional LLVM casting methods (dyn_cast, isa) because they lack 
> classof() method. I have fixed this internally and will put a patch on 
> review soon.
> 
> -------- Conclusion --------
> Finally, there is a graph matching engine supporting basic functionality 
> and two checkers using it. I apologize for not starting the discussion 
> earlier, but I just wasn't sure that the approach will work. Is anybody 
> interested to have this stuff in upstream (maybe, changed or improved)? 
> If yes, I'll be happy to contribute this work patch-by-patch and 
> continue its improvement. If no - well, I had enough fun playing with it.
> 
> Thanks for sharing this results. Regardless of being upstreamed as is or
> in a separate form or not at all, this is an interesting experiment for the
> community. 
>  
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180521/c46f7566/attachment.html>