[cfe-dev] [analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Tue May 22 16:07:51 PDT 2018

18.05.2018 04:47, Artem Dergachev пишет:
> On 5/17/18 5:28 AM, Gábor Horváth wrote:
>> Hi Alexey,
>>
>> In general, I like the idea of having a more declarative way to 
>> define new
>> checks.
>>
>> On Thu, 17 May 2018 at 01:37, Alexey Sidorin <alexey.v.sidorin at ya.ru 
>> <mailto:alexey.v.sidorin at ya.ru>> wrote:
>>
>>     Hello everyone,
>>
>>     I'd like to share some results of my investigation targeted on
>>     improvement of Clang Static Analyzer checker API. You can find some
>>     previous conversation on this topic here:
>>     http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html.
>>
>>     In my investigation, I tried to solve a particular problem of
>>     building a
>>     checker without generating new nodes.
>>
>>
>> Why do you focus on such checks that does not have traits, does not 
>> generate new nodes.
>> Do you find this to be the majority of the checks you need to implement?
>
> Ok, so as far as i understand, these new checkers don't support traits 
> because they don't *need* them. Instead, they plan to use 
> "backreferences" in matchers, i.e. double file descriptor close can be 
> described as:
>
>   hasSequence(
>     call(hasName("fclose"), hasArg(0, sym().bind("fd"))),
>     call(hasName("fclose"), hasArg(0, sym(equalsBoundNode("fd"))))
>   );
>
> Right?
Yes, exactly.

>
> So we can't eliminate the path on which the pointer is null after we 
> dereference it, like we normally do, we can't introduce state splits, 
> we can't produce fatal error nodes, but we can still use finite state 
> machines.
As I told before, I don't like state splitting and generating sinks in 
checker. However, it is not _exactly_ true that we cannot do it with 
matchers - because there are not only matchers, but are also 
_callbacks_. If we perform a match on-the-fly (I didn't implement this, 
unfortunately, but it doesn't look hard), we can do almost anything in 
the callback after a match was found.

>
> The real question is, how do we implement RetainCountChecker that 
> needs to count the potentially infinite number of retains and releases 
> before emitting the warning, and is therefore not a finite state 
> machine?^^
I'm not sure that I understood you correctly, but if we want to use 
something different from sequential model, we can implement our custom 
path matchers - like it is done for AST matchers. It took some time to 
implement a working (but still poorly designed) example:
https://github.com/a-sid/clang/commit/7b94a5e648eed32b9fdc7cb7797b9450b0f3d1ed#diff-0aec754c14839c59eb296a889a7940dbR180. 

The idea was to generalize some most common search patterns into 
matchers. Counting can be implemented pretty easily because counter 
states are still integers. However, if our state cannot be represented 
as an integer, there can be some troubles.
What I have also realized is that some way(s) of using different 
matchers together is needed. Reference counting is pretty useless on its 
own, it needs some actions to match if we are searching for a bug.

>>
>>     --------- Introduction and design goals ---------
>>
>>     In brief, I tried to use matchers-like API to make CSA checkers look
>>     like this:
>>
>>     StatementMatcher NotChdir =
>>     callExpr(unless(callee(functionDecl(hasName("::chdir")))));
>>     Finder.addMatcher(
>>          hasSequence(
>>              postStmt(hasStatement(
>>     callExpr(callee(functionDecl(hasName("::chroot")))))),
>>              unless(stmtPoint(hasStatement(callExpr(
>>                  callee(functionDecl(hasName("::chdir"))),
>>                  hasArgument(0,
>>     hasValue(stringRegion(refersString("/")))))))),
>>     explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
>>     callEnter(hasCallExpr(NotChdir))))
>>                  .bind("bug_node")),
>>          &Callback);
>>     Finder.match(G);
>>
>>
>
> Aha, so am i understanding correctly that unless() here is not your 
> normal unless()? The normal unless() clause would have meant
>
>     "somewhere between chroot() and the offending call there's a node 
> that's not chdir()"
>
> but what we're trying to say is
>
>     "nowhere between chroot() and the offending call there's a node 
> that's chdir()"
>
> ?
Yes. I should probably create a normal matcher and don't reuse unless() 
for this to make it less confusive.

>> I think, a common (design) pitfall of writing checks is to try to 
>> match against
>> the AST when a symbolic execution callback should be used instead.
>> I am a bit afraid having an API like this would make this pitfall 
>> more common.
>> Maybe a separation between the path sensitive, flow sensitive and
>> AST based checks is good for the checker writers and new users and
>> I am not sure that the same abstraction fits all of these cases.
>>
>> In case of path sensitive checks I wonder if a state machine like 
>> abstraction would
>> fit the use cases better (and also cover checks that are using traits)
>> where the guarding conditions of the state transitions can include 
>> AST matcher like checks.
>>
>>
>>     and I have managed to make some simple working examples.
>>
>>     The entire diff can be found here:
>>     https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
>>     The code itself: https://github.com/a-sid/clang/tree/a4
>>
>>     There are several reasons why I have tried this approach.
>>
>>     1. AST Matchers are already extensively used API for AST
>>     checking. They
>>     are available both in Clang-Tidy and CSA. And I wanted to use
>>     existing
>>     functionality as much as possible. So, I decided to extend an
>>     existing
>>     API to make its usage seamless across different kinds of checks:
>>     path-sensitive, AST-based and CFG-based.
>>
>
> When we say CFG-based checks, we usually mean "all paths" checks that 
> find bugs indicated by a certain invariant holding on all paths 
> through a certain program point(s). Such as TestAfterDivideZero or 
> DeadStore.
>
> The problem with CFG-based "all paths" checks that i don't really know 
> how to solve and that's probably not going to be solved by giving them 
> a good API is error reporting. Imagine a conversation with a user:
>
>   A n a l y z e r :  Hey this assignment is not used.
>
>   U s e r :  But hey, it's used here *points to the code*.
>
>   A n a l y z e r :  But this code is dead.
>
>   U s e r :  Why so? Suppose 'x' is equal to true, we jump straight 
> into this code.
>
>   A n a l y z e r :  That's because the initializer for 'x' is in fact 
> always evaluated to false.
>
>   U s e r :  WHY t___t
>
> Etc. The first piece of information is produced by a DeadStore 
> checker. The second piece of information is produced by some sort of 
> DeadCode checker (currently alpha). The third piece of information is 
> produced by some sort of SameResLogicalExpr checker (currently in the 
> list of potential checkers). All three checkers are all-paths 
> checkers. And unless all three checker's diagnostics are presented to 
> the user at the same time, the user will be unable to understand the 
> bug report.
>
> I believe that we should really solve this problem somehow (or explain 
> to me why it isn't a problem) before we try to introduce a massive 
> increase in the number of CFG-based checkers we have (currently 1).
Looks like I have used wrong words to describe what I wanted to do with 
CFG. Initially, I wanted to support CFG search as well but quicky 
realized nobody just needs it; however, some rudiments are still 
remaining in the code. Instead, I meant exposing CFG analyses in a 
user-friendly way, like it is done in TestAfterDivZeroV2. I think it 
makes much more sense because CFG analyses like reachability or 
postdomination can be used for FP suppression in the checkers and in the 
analyzer engine.

>
>> I am not sure if this is a good idea. I think the checker author 
>> supposed to
>> be aware if the check is AST based, flow sensitive, or path sensitive.
>> And this should also be easy to tell by reading the code. I am also not
>> sure whether there is an abstraction that fits all.
>>
>> I think the reason why this idea works well for checks that only 
>> inspect the
>> exploded graph, because in this case we are still doing pattern 
>> matching on
>> graphs in a similar manner as doing pattern matching on the AST.
>>
>> But does this generalize later on to stateful checks?
>>
>>
>>     2. AST matchers effectively help clients to avoid a lot of
>>     checking of
>>     dyn_cast results. This feature not only makes them more
>>     convenient but
>>     also more safe because, in my experience, forgetting a nullptr/None
>>     check is a pretty common mistake for checker writers.
>>
>>
>> I think if something cannot be null we should return references, 
>> otherwise
>> the client need to check for the pointer beeing null (unless there 
>> are some
>> dynamic invariant that would make the check redundant). If an API does
>> not match this philosophy, maybe we should fix the API first.
>>
>> Regardless of fixing the API, I agree that it would be great to have 
>> higher
>> level APIs for checker authors.
>>
>>
>>     3. Testing of AST matchers don't require writing C++ code - it
>>     can be
>>     done interactively with clang-query tool. And I believe that we need
>>     similar functionality for CSA as well.
>>
>>
>> Having a query tool for exploded graphs could be very helpful, I agree.
>> These graphs tend to be very large and sometimes it is not trivial to
>> find the relevant parts during debugging.
>>
>>
>>     I didn't want to replace existing checker API. Instead, I tried
>>     to make
>>     new possibilities usable independently or together with existing.
>>
>>     --------- Design and implementation ---------
>>
>>     The implementation consisted of a number of steps.
>>
>>     1. Allow matching nodes of path-sensitive graph like usual AST
>>     nodes. To
>>     make this possible, three actions were needed:
>>       1.1. ASTTypeTraits and DynTypedNode were extended to support
>>     path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr,
>>     etc.
>>
>>
>> I wonder, if in general it is a good idea to pattern match in checks 
>> on SymExpr.
>> A more general approach here would be to use the constraint solver 
>> for queries
>> in a similar manner how ProgramState::assume works.
>>
>>     The implementation with graph node support is moved into a separate
>>     class (ASTGraphTypeTraits) in ento namespace to resolve cyclic
>>     dependencies (they are still exist, unfortunately, but are
>>     header-only,
>>     so we can build the PoC).
>>       1.2. Some additions to AST matchers were made to add support
>>     for new
>>     kinds of nodes.
>>       1.3. To make MatchFinder able to query specific options not
>>     supported
>>     by pure AST, it was augmented with "Contexts". A matcher that
>>     needs to
>>     query the path-sensitive engine asks the Finder for the required
>>     Context
>>     which provides specific helper functions.
>>
>>     As the result of this step, we are able to write matchers like
>>     expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).
>>
>
> Yeah, we'll also eventually need operators over symbols, eg.:
>
>     arraySubscriptExpr(hasSubscript(expr(hasSVal(sym(isLessThan(
> sym(equalsBoundNode("array_size")))).bind("array_subscript")))))
There is nothing impossible here - such expressions should be easy to 
write with augmented AST matchers used in this PoC.
>
> Still, SVal matchers are something i always wanted. I even made SVal 
> visitors. The latter turned out to be pretty useless though :/
>
>> I think this is the part of the proposal that I like the most, it 
>> would be
>> a very concise way to write down guarding conditions.
>>
>>
>>     2. Create an engine for graph exploration and matching.
>>        Unlike normal AST matchers, hasSequence is a path-sensitive
>>     matcher.
>>     It is launched against ExplodedGraph. These matchers are handled by
>>     GraphMatchFinder object. While searching a graph, it collects
>>     matches.
>>     Each match contains a pointer to the corresponding matcher and
>>     State ID
>>     of this match. The way how matches are translated from one state
>>     ID to
>>     another is determined by matcher operators.
>>
>>
>> Is this matching done after the exploded graph already built? If so,
>> these checks will be unable to generate sink nodes. I think having sink
>> nodes sometimes desirable even if the check itself does not have a trait.
>
> I'm really curious about how the performance scales with the number of 
> checkers. Is it going to be significantly more expensive to run 100 
> DSL-based checkers on a fully built graph than to run 100 normal 
> checkers as the graph is being built?
No benchmarks => no answer, sorry. My wild guess won't work here :(
>
> This boils down to: by looking at the sub-matcher, will you be able to 
> filter out checkers that don't need to act upon the node as quickly as 
> you do that when calling checker callbacks?

I think it's a hard goal to achieve.
1. To separate matcher callbacks, we need for matchers to expose 
meta-information. Some single matchers also can participate in multiple 
callbacks (anyOf()).
2. To make matchers callable with a matcher which is a part of it, we 
need to re-design the path matchers. They should have a mapping between 
sub-matchers and actions required if a sub-matcher matches.

>
> And this immediately leads to: how difficult would it be to simply 
> auto-generate an old-style checker (with state traits and callbacks) 
> from the new-style matcher? Maybe that's something we should focus on 
> instead of trying to match a fully constructed graph? Maybe it'd also 
> be more powerful, i.e. allow certain imperative side effects whenever 
> partial match is encountered?
We can use matchers inside checker callbacks if we need extended 
functionality. The auto-generation for matchers with side effects is 
non-trivial and needs supporting partial match actions. However, since 
we have BindableMatchers, we can also have ActingMatchers (unsupported 
dynamically) that have `.call()` method.

>>
>>        For example, SequenceVariadicOperator, which is the base of
>>     hasSequence() matcher, has "positive" and "negative" sub-matches.
>>     Each
>>     positive matcher has its corresponding State ID. In order to
>>     advance to
>>     the next State ID, a node being matched should match all "negative"
>>     matchers before the next "positive" matchers and the next "positive"
>>     matcher itself. Failure in matching "negative" matcher discards
>>     the match.
>>
>>        The role of GraphMatchFinder is similar to MatchFinder: it is
>>     only
>>     responsible for graph exploration and keeping bound nodes and
>>     matchers.
>>
>>     3. Manage bound nodes.
>>        While matching a single graph node, BoundNodes from AST
>>     MatchFinder
>>     are used. AST matchers for path-sensitive nodes support bindings
>>     out-of-the-box. However, the situation with graph matching is a bit
>>     different. For graph matching, we have another system of bound
>>     nodes.
>>     Each graph node has a related map of bounds aka GDMTy (yes, the
>>     name is
>>     not a coincidence). GDMTy is a mapping from match ID to
>>     BoundNodesMap
>>     which, in part, is effectively a map from std::string to
>>     DynTypedNodes.
>>     This look pretty much like how GDM is organized in ExplodedGraph,
>>     just
>>     without one level of indirection (it can be added, though).
>>
>>        MatchFinder contexts should allow support of their own
>>     bindings. This
>>     is how equalsBoundNode() matcher works for path-sensitive nodes:
>>     it just
>>     queries all available contexts for the binding.
>>
>>     Finally, I have managed to make two checkers work in this way:
>>     ChrootChecker (which is present in the introduction) and
>>     TestAfterDivZeroChecker. Both them can be found in
>>     ChrootCheckerV2.cpp
>>     and TestAfterDivZeroCheckerV2.cpp correspondingly. They act like
>>     normal
>>     checkers: produce warnings and use visitors.
>>
>
> I wonder if our false positive suppression visitors (or maybe even 
> regular checker bugvisitors) can be rewritten with these matchers. 
> trackNullOrUndefValue() and its system of visitors doesn't look like a 
> matcher at all to me, unfortunately - it can potentially track an 
> infinite amount of events. But every individual visitor within that 
> system probably is.
Hm, I didn't even think about it. Do you think it can give us some 
advantages for BugVisitors?

>
>>     The main difference is that
>>     they cannot generate sinks and use checkEndAnalysis callback. The
>>     code
>>     of these checkers can be found here:
>>
>>     https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
>>     https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp
>>
>>
>>     -------- Some features --------
>>
>>     1.The design of bindings has an interesting consequence: it
>>     enables the
>>     dynamic introspection of GDM which was pretty hard before. (Hello
>>     alpha
>>     renaming?)
>>     2. Nothing prevents matchers to be used with existing checker API
>>     for
>>     simplifying conditional checking inside callbacks. The matchers
>>     are not
>>     planned as the replacement for the current API, but look like a nice
>>     complementary part.
>>     3. Implicit conversion of Matcher<ProgramPoint> to
>>     Matcher<ExplodedNode>
>>     and Matcher<SymExpr || MemRegion> to Matcher<SVal> for writing
>>     shorter code.
>>
>>     -------- Not implemented/not checked yet --------
>>     I tried to keep the PoC as minimal as possible. As the result, some
>>     features are not implemented yet, but I believe they can be added.
>>     1. Usage of matchers inside checker callbacks
>>        This is not exactly unimplemented, but is still untested.
>>     2. Dynamic matchers (clang-query interface)
>>        In addition to work for supporting dynamic nodes (I don't know
>>     how
>>     many efforts it can take), we need to enable matching against AST
>>     nodes,
>>     not graph. But it doesn't look as a problem, we can just make
>>     matchers
>>     polymorphic.
>>     3. Binding to successfully matched paths is not implemented yet -
>>     only
>>     bindings to separate nodes. I wonder if we need path bindings at all.
>>     4. Some corner cases are still FIXMEs: single-node sequences, for
>>     example.
>>     5. GDM is not based on immutable data structures like it is done
>>     in CSA
>>     - it is just an STL map. Its performance can be poor (full copy
>>     on every
>>     new node), but I don't think that changing it to use immutable
>>     structures is hard.
>>     6. Matching on-the-fly
>>        GraphMatchFinder should support on-the-fly matching during
>>     ExplodedGraph building. For this support, we should just call its
>>     advance() method each time a new node is created. However, this
>>     possibility wasn't checked yet.
>>     7. Matching CFG and CallGraph isn't implemented because it was
>>     considered to be far out of simple PoC.
>>     8. Only sequential matching is available now, and I didn't try to
>>     implement other operators yet. So, implementing a checker similar to
>>     PthreadLock can be tricky or even impossible for now.
>>
>>     -------- Known and potential issues --------
>>      From matchers' side:
>>     1. Some tests don't pass because they rely on the checker ability to
>>     generate sink nodes. Our matchers cannot do it by design so tests
>>     don't
>>     pass.
>>     2. Representing checker bindings as a map can be less effective than
>>     storing data in structures. I didn't measure the overhead, so I
>>     cannot
>>     give any numbers.
>>     3. Matchers are called on every node independently of its type.
>>     This is
>>     not what CheckerManager does. I wonder if this detail can affect
>>     performance as well.
>>     4. Problems with cyclic dependencies. clangStaticAnalyzer has a
>>     dependency on clangASTMatchers, therefore, clangASTMatchers cannot
>>     depend on clangStaticAnalyzer. However, if we want ASTMatchers to
>>     support static analyzer data structures, there should be a backward
>>     dependency. Now this dependency is header-only.
>>     5. Code duplication. This is mostly fine for a sandbox but needs
>>     to be
>>     reduced.
>>
>>      From analyzer's side:
>>     1. Many events are not reflected in the final ExplodedGraph. For
>>     example, checkers can receive PointerEscape event, but the event
>>     itself
>>     will not be recorded in the ExplodedGraph - only changes made by
>>     checkers will.
>>
>
> Weeeell maaaaaybe. This would essentially mean passing a 
> CheckerContext into checkPointerEscape and checkRegionChanges. With a 
> predecessor node having a new sort of ProgramPoint such as 
> "PointerEscapePoint". And we can put the list of escaped stuff 
> directly into the program point instead of passing it separately.
>
> Which means allowing state splits in these callbacks. Which of course 
> can be disallowed later (to keep the code that calls the callbacks 
> simple), and graph matchers won't need them any time soon anyway.
>
> But it'd still be a lot of refactoring because currently these 
> callbacks are called from code that doesn't even know anything about 
> nodes, eg. from within State->invalidateRegions().
Internally, I created a dummy node in CheckerManager methods and then 
launched checker callbacks with it. But this doesn't look like a right way.

>
>>     That's also true for Stmt nodes: I noticed same issues
>>     with PostCondition. This makes matching a bit harder. Should we
>>     add some
>>     information into ExplodedGraph?
>>     2. (Minor) Some static analyzer data structures don't support
>>     traditional LLVM casting methods (dyn_cast, isa) because they lack
>>     classof() method. I have fixed this internally and will put a
>>     patch on
>>     review soon.
>>
>>     -------- Conclusion --------
>>     Finally, there is a graph matching engine supporting basic
>>     functionality
>>     and two checkers using it. I apologize for not starting the
>>     discussion
>>     earlier, but I just wasn't sure that the approach will work. Is
>>     anybody
>>     interested to have this stuff in upstream (maybe, changed or
>>     improved)?
>>     If yes, I'll be happy to contribute this work patch-by-patch and
>>     continue its improvement. If no - well, I had enough fun playing
>>     with it.
>>
>>
>> Thanks for sharing this results. Regardless of being upstreamed as is or
>> in a separate form or not at all, this is an interesting experiment 
>> for the
>> community.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180523/737ef97d/attachment.html>