<div dir="ltr"><div>Hi Alexey,</div><div><br></div><div>In general, I like the idea of having a more declarative way to define new</div><div>checks.<br></div><div><br></div><div><div class="gmail_quote"><div dir="ltr">On Thu, 17 May 2018 at 01:37, Alexey Sidorin <<a href="mailto:alexey.v.sidorin@ya.ru">alexey.v.sidorin@ya.ru</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello everyone,<br>

<br>

I'd like to share some results of my investigation targeted on <br>

improvement of Clang Static Analyzer checker API. You can find some <br>

previous conversation on this topic here: <br>

<a href="http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html" rel="noreferrer" target="_blank">http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html</a>. <br>

In my investigation, I tried to solve a particular problem of building a <br>

checker without generating new nodes.<br></blockquote><div><br></div><div>Why do you focus on such checks that does not have traits, does not generate new nodes. <br></div><div>Do you find this to be the majority of the checks you need to implement? <br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

--------- Introduction and design goals ---------<br>

<br>

In brief, I tried to use matchers-like API to make CSA checkers look <br>

like this:<br>

<br>

StatementMatcher NotChdir =<br>

     callExpr(unless(callee(functionDecl(hasName("::chdir")))));<br>

Finder.addMatcher(<br>

     hasSequence(<br>

         postStmt(hasStatement(<br>

             callExpr(callee(functionDecl(hasName("::chroot")))))),<br>

         unless(stmtPoint(hasStatement(callExpr(<br>

             callee(functionDecl(hasName("::chdir"))),<br>

             hasArgument(0, hasValue(stringRegion(refersString("/")))))))),<br>

         explodedNode(anyOf(postStmt(hasStatement(NotChdir)),<br>

                             callEnter(hasCallExpr(NotChdir))))<br>

             .bind("bug_node")),<br>

     &Callback);<br>

Finder.match(G);<br></blockquote><div><br></div><div>I think, a common (design) pitfall of writing checks is to try to match against</div><div>the AST when a symbolic execution callback should be used instead.</div><div>I am a bit afraid having an API like this would make this pitfall more common.</div><div>Maybe a separation between the path sensitive, flow sensitive and</div><div>AST based checks is good for the checker writers and new users and</div><div>I am not sure that the same abstraction fits all of these cases.</div><div><br></div><div>In case of path sensitive checks I wonder if a state machine like abstraction would</div><div>fit the use cases better (and also cover checks that are using traits)</div><div>where the guarding conditions of the state transitions can include AST matcher like checks. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

and I have managed to make some simple working examples.<br>

<br>

The entire diff can be found here: <br>

<a href="https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f" rel="noreferrer" target="_blank">https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f</a><br>

The code itself: <a href="https://github.com/a-sid/clang/tree/a4" rel="noreferrer" target="_blank">https://github.com/a-sid/clang/tree/a4</a><br>

<br>

There are several reasons why I have tried this approach.<br>

<br>

1. AST Matchers are already extensively used API for AST checking. They <br>

are available both in Clang-Tidy and CSA. And I wanted to use existing <br>

functionality as much as possible. So, I decided to extend an existing <br>

API to make its usage seamless across different kinds of checks: <br>

path-sensitive, AST-based and CFG-based.<br></blockquote><div><br></div><div>I am not sure if this is a good idea. I think the checker author supposed to</div><div>be aware if the check is AST based, flow sensitive, or path sensitive. <br></div><div>And this should also be easy to tell by reading the code. I am also not</div><div>sure whether there is an abstraction that fits all.</div><div><br></div><div>I think the reason why this idea works well for checks that only inspect the</div><div>exploded graph, because in this case we are still doing pattern matching on</div><div>graphs in a similar manner as doing pattern matching on the AST.</div><div><br></div><div>But does this generalize later on to stateful checks? <br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

2. AST matchers effectively help clients to avoid a lot of checking of <br>

dyn_cast results. This feature not only makes them more convenient but <br>

also more safe because, in my experience, forgetting a nullptr/None <br>

check is a pretty common mistake for checker writers.<br></blockquote><div><br></div><div><div>I think if something cannot be null we should return references, otherwise</div><div>the client need to check for the pointer beeing null (unless there are some</div><div>dynamic invariant that would make the check redundant). If an API does</div><div>not match this philosophy, maybe we should fix the API first. </div><div><br></div><div>Regardless of fixing the API, I agree that it would be great to have higher</div><div>level APIs for checker authors.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

3. Testing of AST matchers don't require writing C++ code - it can be <br>

done interactively with clang-query tool. And I believe that we need <br>

similar functionality for CSA as well.<br></blockquote><div><br></div>Having a query tool for exploded graphs could be very helpful, I agree.</div><div class="gmail_quote">These graphs tend to be very large and sometimes it is not trivial to</div><div class="gmail_quote">find the relevant parts during debugging.<br></div><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I didn't want to replace existing checker API. Instead, I tried to make <br>

new possibilities usable independently or together with existing.<br>

<br>

--------- Design and implementation ---------<br>

<br>

The implementation consisted of a number of steps.<br>

<br>

1. Allow matching nodes of path-sensitive graph like usual AST nodes. To <br>

make this possible, three actions were needed:<br>

  1.1. ASTTypeTraits and DynTypedNode were extended to support <br>

path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, etc. <br></blockquote><div><br></div><div>I wonder, if in general it is a good idea to pattern match in checks on SymExpr.</div><div>A more general approach here would be to use the constraint solver for queries</div><div>in a similar manner how ProgramState::assume works.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The implementation with graph node support is moved into a separate <br>

class (ASTGraphTypeTraits) in ento namespace to resolve cyclic <br>

dependencies (they are still exist, unfortunately, but are header-only, <br>

so we can build the PoC).<br>

  1.2. Some additions to AST matchers were made to add support for new <br>

kinds of nodes.<br>

  1.3. To make MatchFinder able to query specific options not supported <br>

by pure AST, it was augmented with "Contexts". A matcher that needs to <br>

query the path-sensitive engine asks the Finder for the required Context <br>

which provides specific helper functions.<br>

<br>

As the result of this step, we are able to write matchers like <br>

expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).<br></blockquote><div><br></div><div>I think this is the part of the proposal that I like the most, it would be</div><div>a very concise way to write down guarding conditions.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

2. Create an engine for graph exploration and matching.<br>

   Unlike normal AST matchers, hasSequence is a path-sensitive matcher. <br>

It is launched against ExplodedGraph. These matchers are handled by <br>

GraphMatchFinder object. While searching a graph, it collects matches. <br>

Each match contains a pointer to the corresponding matcher and State ID <br>

of this match. The way how matches are translated from one state ID to <br>

another is determined by matcher operators.<br></blockquote><div><br></div><div>Is this matching done after the exploded graph already built? If so,</div><div>these checks will be unable to generate sink nodes. I think having sink</div><div>nodes sometimes desirable even if the check itself does not have a trait.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

   For example, SequenceVariadicOperator, which is the base of <br>

hasSequence() matcher, has "positive" and "negative" sub-matches. Each <br>

positive matcher has its corresponding State ID. In order to advance to <br>

the next State ID, a node being matched should match all "negative" <br>

matchers before the next "positive" matchers and the next "positive" <br>

matcher itself. Failure in matching "negative" matcher discards the match.<br>

<br>

   The role of GraphMatchFinder is similar to MatchFinder: it is only <br>

responsible for graph exploration and keeping bound nodes and matchers.<br>

<br>

3. Manage bound nodes.<br>

   While matching a single graph node, BoundNodes from AST MatchFinder <br>

are used. AST matchers for path-sensitive nodes support bindings <br>

out-of-the-box. However, the situation with graph matching is a bit <br>

different. For graph matching, we have another system of bound nodes. <br>

Each graph node has a related map of bounds aka GDMTy (yes, the name is <br>

not a coincidence). GDMTy is a mapping from match ID to BoundNodesMap <br>

which, in part, is effectively a map from std::string to DynTypedNodes. <br>

This look pretty much like how GDM is organized in ExplodedGraph, just <br>

without one level of indirection (it can be added, though).<br>

<br>

   MatchFinder contexts should allow support of their own bindings. This <br>

is how equalsBoundNode() matcher works for path-sensitive nodes: it just <br>

queries all available contexts for the binding.<br>

<br>

Finally, I have managed to make two checkers work in this way: <br>

ChrootChecker (which is present in the introduction) and <br>

TestAfterDivZeroChecker. Both them can be found in ChrootCheckerV2.cpp <br>

and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like normal <br>

checkers: produce warnings and use visitors. The main difference is that <br>

they cannot generate sinks and use checkEndAnalysis callback. The code <br>

of these checkers can be found here:<br>

<br>

<a href="https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp" rel="noreferrer" target="_blank">https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp</a><br>

<a href="https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp" rel="noreferrer" target="_blank">https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp</a><br>

<br>

<br>

-------- Some features --------<br>

<br>

1.The design of bindings has an interesting consequence: it enables the <br>

dynamic introspection of GDM which was pretty hard before. (Hello alpha <br>

renaming?)<br>

2. Nothing prevents matchers to be used with existing checker API for <br>

simplifying conditional checking inside callbacks. The matchers are not <br>

planned as the replacement for the current API, but look like a nice <br>

complementary part.<br>

3. Implicit conversion of Matcher<ProgramPoint> to Matcher<ExplodedNode> <br>

and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing shorter code.<br>

<br>

-------- Not implemented/not checked yet --------<br>

I tried to keep the PoC as minimal as possible. As the result, some <br>

features are not implemented yet, but I believe they can be added.<br>

1. Usage of matchers inside checker callbacks<br>

   This is not exactly unimplemented, but is still untested.<br>

2. Dynamic matchers (clang-query interface)<br>

   In addition to work for supporting dynamic nodes (I don't know how <br>

many efforts it can take), we need to enable matching against AST nodes, <br>

not graph. But it doesn't look as a problem, we can just make matchers <br>

polymorphic.<br>

3. Binding to successfully matched paths is not implemented yet - only <br>

bindings to separate nodes. I wonder if we need path bindings at all.<br>

4. Some corner cases are still FIXMEs: single-node sequences, for example.<br>

5. GDM is not based on immutable data structures like it is done in CSA <br>

- it is just an STL map. Its performance can be poor (full copy on every <br>

new node), but I don't think that changing it to use immutable <br>

structures is hard.<br>

6. Matching on-the-fly<br>

   GraphMatchFinder should support on-the-fly matching during <br>

ExplodedGraph building. For this support, we should just call its <br>

advance() method each time a new node is created. However, this <br>

possibility wasn't checked yet.<br>

7. Matching CFG and CallGraph isn't implemented because it was <br>

considered to be far out of simple PoC.<br>

8. Only sequential matching is available now, and I didn't try to <br>

implement other operators yet. So, implementing a checker similar to <br>

PthreadLock can be tricky or even impossible for now.<br>

<br>

-------- Known and potential issues --------<br>

 From matchers' side:<br>

1. Some tests don't pass because they rely on the checker ability to <br>

generate sink nodes. Our matchers cannot do it by design so tests don't <br>

pass.<br>

2. Representing checker bindings as a map can be less effective than <br>

storing data in structures. I didn't measure the overhead, so I cannot <br>

give any numbers.<br>

3. Matchers are called on every node independently of its type. This is <br>

not what CheckerManager does. I wonder if this detail can affect <br>

performance as well.<br>

4. Problems with cyclic dependencies. clangStaticAnalyzer has a <br>

dependency on clangASTMatchers, therefore, clangASTMatchers cannot <br>

depend on clangStaticAnalyzer. However, if we want ASTMatchers to <br>

support static analyzer data structures, there should be a backward <br>

dependency. Now this dependency is header-only.<br>

5. Code duplication. This is mostly fine for a sandbox but needs to be <br>

reduced.<br>

<br>

 From analyzer's side:<br>

1. Many events are not reflected in the final ExplodedGraph. For <br>

example, checkers can receive PointerEscape event, but the event itself <br>

will not be recorded in the ExplodedGraph - only changes made by <br>

checkers will. That's also true for Stmt nodes: I noticed same issues <br>

with PostCondition. This makes matching a bit harder. Should we add some <br>

information into ExplodedGraph?<br>

2. (Minor) Some static analyzer data structures don't support <br>

traditional LLVM casting methods (dyn_cast, isa) because they lack <br>

classof() method. I have fixed this internally and will put a patch on <br>

review soon.<br>

<br>

-------- Conclusion --------<br>

Finally, there is a graph matching engine supporting basic functionality <br>

and two checkers using it. I apologize for not starting the discussion <br>

earlier, but I just wasn't sure that the approach will work. Is anybody <br>

interested to have this stuff in upstream (maybe, changed or improved)? <br>

If yes, I'll be happy to contribute this work patch-by-patch and <br>

continue its improvement. If no - well, I had enough fun playing with it.<br></blockquote><div><br></div><div>Thanks for sharing this results. Regardless of being upstreamed as is or</div><div>in a separate form or not at all, this is an interesting experiment for the</div><div>community. <br></div><div> </div></div></div></div>