[cfe-dev] [analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Wed May 16 16:37:11 PDT 2018

Hello everyone,

I'd like to share some results of my investigation targeted on 
improvement of Clang Static Analyzer checker API. You can find some 
previous conversation on this topic here: 
http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. 
In my investigation, I tried to solve a particular problem of building a 
checker without generating new nodes.

--------- Introduction and design goals ---------

In brief, I tried to use matchers-like API to make CSA checkers look 
like this:

StatementMatcher NotChdir =
     callExpr(unless(callee(functionDecl(hasName("::chdir")))));
Finder.addMatcher(
     hasSequence(
         postStmt(hasStatement(
             callExpr(callee(functionDecl(hasName("::chroot")))))),
         unless(stmtPoint(hasStatement(callExpr(
             callee(functionDecl(hasName("::chdir"))),
             hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
         explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
                             callEnter(hasCallExpr(NotChdir))))
             .bind("bug_node")),
     &Callback);
Finder.match(G);

and I have managed to make some simple working examples.

The entire diff can be found here: 
https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
The code itself: https://github.com/a-sid/clang/tree/a4

There are several reasons why I have tried this approach.

1. AST Matchers are already extensively used API for AST checking. They 
are available both in Clang-Tidy and CSA. And I wanted to use existing 
functionality as much as possible. So, I decided to extend an existing 
API to make its usage seamless across different kinds of checks: 
path-sensitive, AST-based and CFG-based.

2. AST matchers effectively help clients to avoid a lot of checking of 
dyn_cast results. This feature not only makes them more convenient but 
also more safe because, in my experience, forgetting a nullptr/None 
check is a pretty common mistake for checker writers.

3. Testing of AST matchers don't require writing C++ code - it can be 
done interactively with clang-query tool. And I believe that we need 
similar functionality for CSA as well.

I didn't want to replace existing checker API. Instead, I tried to make 
new possibilities usable independently or together with existing.

--------- Design and implementation ---------

The implementation consisted of a number of steps.

1. Allow matching nodes of path-sensitive graph like usual AST nodes. To 
make this possible, three actions were needed:
  1.1. ASTTypeTraits and DynTypedNode were extended to support 
path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. 
The implementation with graph node support is moved into a separate 
class (ASTGraphTypeTraits) in ento namespace to resolve cyclic 
dependencies (they are still exist, unfortunately, but are header-only, 
so we can build the PoC).
  1.2. Some additions to AST matchers were made to add support for new 
kinds of nodes.
  1.3. To make MatchFinder able to query specific options not supported 
by pure AST, it was augmented with "Contexts". A matcher that needs to 
query the path-sensitive engine asks the Finder for the required Context 
which provides specific helper functions.

As the result of this step, we are able to write matchers like 
expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).

2. Create an engine for graph exploration and matching.
   Unlike normal AST matchers, hasSequence is a path-sensitive matcher. 
It is launched against ExplodedGraph. These matchers are handled by 
GraphMatchFinder object. While searching a graph, it collects matches. 
Each match contains a pointer to the corresponding matcher and State ID 
of this match. The way how matches are translated from one state ID to 
another is determined by matcher operators.

   For example, SequenceVariadicOperator, which is the base of 
hasSequence() matcher, has "positive" and "negative" sub-matches. Each 
positive matcher has its corresponding State ID. In order to advance to 
the next State ID, a node being matched should match all "negative" 
matchers before the next "positive" matchers and the next "positive" 
matcher itself. Failure in matching "negative" matcher discards the match.

   The role of GraphMatchFinder is similar to MatchFinder: it is only 
responsible for graph exploration and keeping bound nodes and matchers.

3. Manage bound nodes.
   While matching a single graph node, BoundNodes from AST MatchFinder 
are used. AST matchers for path-sensitive nodes support bindings 
out-of-the-box. However, the situation with graph matching is a bit 
different. For graph matching, we have another system of bound nodes. 
Each graph node has a related map of bounds aka GDMTy (yes, the name is 
not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap 
which, in part, is effectively a map from std::string to DynTypedNodes. 
This look pretty much like how GDM is organized in ExplodedGraph, just 
without one level of indirection (it can be added, though).

   MatchFinder contexts should allow support of their own bindings. This 
is how equalsBoundNode() matcher works for path-sensitive nodes: it just 
queries all available contexts for the binding.

Finally, I have managed to make two checkers work in this way: 
ChrootChecker (which is present in the introduction) and 
TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp 
and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal 
checkers: produce warnings and use visitors. The main difference is that 
they cannot generate sinks and use checkEndAnalysis callback. The code 
of these checkers can be found here:

https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp

-------- Some features --------

1.The design of bindings has an interesting consequence: it enables the 
dynamic introspection of GDM which was pretty hard before. (Hello alpha 
renaming?)
2. Nothing prevents matchers to be used with existing checker API for 
simplifying conditional checking inside callbacks. The matchers are not 
planned as the replacement for the current API, but look like a nice 
complementary part.
3. Implicit conversion of Matcher<ProgramPoint> to Matcher<ExplodedNode> 
and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing shorter code.

-------- Not implemented/not checked yet --------
I tried to keep the PoC as minimal as possible. As the result, some 
features are not implemented yet, but I believe they can be added.
1. Usage of matchers inside checker callbacks
   This is not exactly unimplemented, but is still untested.
2. Dynamic matchers (clang-query interface)
   In addition to work for supporting dynamic nodes (I don't know how 
many efforts it can take), we need to enable matching against AST nodes, 
not graph. But it doesn't look as a problem, we can just make matchers 
polymorphic.
3. Binding to successfully matched paths is not implemented yet - only 
bindings to separate nodes. I wonder if we need path bindings at all.
4. Some corner cases are still FIXMEs: single-node sequences, for example.
5. GDM is not based on immutable data structures like it is done in CSA 
- it is just an STL map. Its performance can be poor (full copy on every 
new node), but I don't think that changing it to use immutable 
structures is hard.
6. Matching on-the-fly
   GraphMatchFinder should support on-the-fly matching during 
ExplodedGraph building. For this support, we should just call its 
advance() method each time a new node is created. However, this 
possibility wasn't checked yet.
7. Matching CFG and CallGraph isn't implemented because it was 
considered to be far out of simple PoC.
8. Only sequential matching is available now, and I didn't try to 
implement other operators yet. So, implementing a checker similar to 
PthreadLock can be tricky or even impossible for now.

-------- Known and potential issues --------
 From matchers' side:
1. Some tests don't pass because they rely on the checker ability to 
generate sink nodes. Our matchers cannot do it by design so tests don't 
pass.
2. Representing checker bindings as a map can be less effective than 
storing data in structures. I didn't measure the overhead, so I cannot 
give any numbers.
3. Matchers are called on every node independently of its type. This is 
not what CheckerManager does. I wonder if this detail can affect 
performance as well.
4. Problems with cyclic dependencies. clangStaticAnalyzer has a 
dependency on clangASTMatchers, therefore, clangASTMatchers cannot 
depend on clangStaticAnalyzer. However, if we want ASTMatchers to 
support static analyzer data structures, there should be a backward 
dependency. Now this dependency is header-only.
5. Code duplication. This is mostly fine for a sandbox but needs to be 
reduced.

 From analyzer's side:
1. Many events are not reflected in the final ExplodedGraph. For 
example, checkers can receive PointerEscape event, but the event itself 
will not be recorded in the ExplodedGraph - only changes made by 
checkers will. That's also true for Stmt nodes: I noticed same issues 
with PostCondition. This makes matching a bit harder. Should we add some 
information into ExplodedGraph?
2. (Minor) Some static analyzer data structures don't support 
traditional LLVM casting methods (dyn_cast, isa) because they lack 
classof() method. I have fixed this internally and will put a patch on 
review soon.

-------- Conclusion --------
Finally, there is a graph matching engine supporting basic functionality 
and two checkers using it. I apologize for not starting the discussion 
earlier, but I just wasn't sure that the approach will work. Is anybody 
interested to have this stuff in upstream (maybe, changed or improved)? 
If yes, I'll be happy to contribute this work patch-by-patch and 
continue its improvement. If no - well, I had enough fun playing with it.