[cfe-dev] [analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Wed Jul 18 22:32:44 PDT 2018

Hi George,

Unfortunately, I had to suspend the work on this. I hope I return to it 
later, but for now it is not completed yet. Do you want to see a 
work-in-progress patch on Phabricator?
If you just want to take a quick look, you can see the progress on 
Github: https://github.com/a-sid/clang/commits/a4-counting-matches

02.07.2018 21:34, George Karpenkov пишет:
>
>
>> On Jul 2, 2018, at 1:39 AM, Aleksei Sidorin <a.sidorin at samsung.com 
>> <mailto:a.sidorin at samsung.com>> wrote:
>>
>> Hi George,
>>
>> Sorry for late response.
>>
>> 16.06.2018 05:14, George Karpenkov via cfe-dev пишет:
>>>
>>>
>>>> On May 22, 2018, at 4:37 PM, Alexey Sidorin <alexey.v.sidorin at ya.ru 
>>>> <mailto:alexey.v.sidorin at ya.ru>> wrote:
>>>>
>>>> 22.05.2018 04:35, George Karpenkov пишет:
>>>>> Hi Alexey,
>>>>>
>>>>> As classics say, "Any sufficiently 
>>>>> complicated C or Fortran program contains an ad-hoc, 
>>>>> informally-specified, bug-ridden, slow implementation of half 
>>>>> of Common Lisp.”
>>>> Yes,https://xkcd.com/297/is what I was always thinking while 
>>>> writing new AST matchers. Time-honored tradition :)
>>>>>
>>>>> Jokes aside, I think the concept and what you are doing is great,
>>>>> and we could certainly benefit from declarative matchers.
>>>>> However, I think the actual implementation and the set of matchers 
>>>>> and predicates would require quite a bit of bikeshedding.
>>>> Sure. If we need this stuff - the design details are subject for 
>>>> discussion and change.
>>>>
>>>>>
>>>>> Are you familiar with similar works in this area?
>>>>> E.g. Oracle has a Soufflé project doing a similar task: 
>>>>> http://souffle-lang.org <http://souffle-lang.org/>.
>>>>> They have found achieving high performance very challenging, and 
>>>>> IIRC they resort to generating C++ code from the declaratively 
>>>>> described matcher.
>>>> My goal was not to achieve extreme performance but not to hurt 
>>>> analyzer's performance too much. Don't know how far I am from this 
>>>> target, unfortunately.
>>>>
>>>>> Maybe we would have to do the same in tablegen spirit.
>>>>>
>>>>> I’m sure there are more such works in the literature.
>>>> Yes, I know about some researches in this area, but thank you anyway!
>>>>>
>>>>>> On May 16, 2018, at 4:37 PM, Alexey Sidorin via cfe-dev 
>>>>>> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> I'd like to share some results of my investigation targeted on 
>>>>>> improvement of Clang Static Analyzer checker API. You can find 
>>>>>> some previous conversation on this topic 
>>>>>> here:http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. 
>>>>>> In my investigation, I tried to solve a particular problem of 
>>>>>> building a checker without generating new nodes.
>>>>>
>>>>> I agree that this is a worthy goal.
>>>>>
>>>>>>
>>>>>> --------- Introduction and design goals ---------
>>>>>>
>>>>>> In brief, I tried to use matchers-like API to make CSA checkers 
>>>>>> look like this:
>>>>>>
>>>>>> StatementMatcher NotChdir =
>>>>>> callExpr(unless(callee(functionDecl(hasName("::chdir")))));
>>>>>> Finder.addMatcher(
>>>>>> hasSequence(
>>>>>> postStmt(hasStatement(
>>>>>> callExpr(callee(functionDecl(hasName("::chroot")))))),
>>>>>> unless(stmtPoint(hasStatement(callExpr(
>>>>>> callee(functionDecl(hasName("::chdir"))),
>>>>>> hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
>>>>>> explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
>>>>>> callEnter(hasCallExpr(NotChdir))))
>>>>>> .bind("bug_node")),
>>>>>> &Callback);
>>>>>> Finder.match(G);
>>>>>
>>>>> Bikeshedding: I think it’s much better for the sequence matcher to 
>>>>> run from start to end,
>>>>> since that’s how we think about the execution of a program.
>>>> Not sure that I understand completely what do you mean. The nodes 
>>>> are written in the same sequence as they should be on the path in 
>>>> ExplodedGraph.
>>>
>>> Yeah, but that doesn’t matter, right? We need to get a good API for 
>>> checker writers, and the events are generally described in their 
>>> progression as going forward.
>> Sorry, I still didn't get you. The events in the matcher are already 
>> described in the order they should happen to report a bug:
>> 1. Call to `chroot`.
>> 2. Do not call `chdir("/")`.
>> 3. Call anything but `chdir("/").
>> Or do you mean something else?
>
> The description is helpful.
> I’ve meant that when designing a DSL for describing checkers we should 
> think about the most convenient way to describe checkers:
> and checks are normally described going forward (as you just did), not 
> going backward.
>
> Overall, sounds very interesting, do you plan about posting a patch on 
> Phabricator?
>
> What makes me really excited about this is that with such a DSL we 
> could then load checkers from text files at runtime,
> which would give our users a much easier way to write checkers.
>
>>
>>> Also “explodedNode” is probably an unfortunate name, and “node” 
>>> would be better.
>> Naming is always a subject to change.
>>>
>>> I am also confused by the semantics of your list, since it mixes AST 
>>> and “node” statements.
>> In this example, statement matchers are implicitly converted to 
>> exploded node matchers and are applied only to ExplodedNodes that 
>> have StmtPoint as their ProgramPoint. Initially, I was thinking it is 
>> a good idea but soon discovered that I was wrong because it causes a 
>> mess between all kinds of StmtPoint. This implicit conversions is 
>> likely to be removed. There is also another implicit conversion from 
>> ProgramPoint matcher to ExplodedNode matcher and I found it to be 
>> much more useful.
>>> Why is only the last matcher wrapped in “node”?
>> All sub-matchers of `hasSequence()` are matchers on ExplodedNode, but 
>> we need to emit a warning on the last node, so we need something to 
>> bind against. The bound node is used to emit a warning on it.
>>
>>>
>>>>>
>>>>>>
>>>>>> and I have managed to make some simple working examples.
>>>>>>
>>>>>> The entire diff can be found 
>>>>>> here:https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
>>>>>> The code itself:https://github.com/a-sid/clang/tree/a4
>>>>>>
>>>>>> There are several reasons why I have tried this approach.
>>>>>>
>>>>>> 1. AST Matchers are already extensively used API for AST 
>>>>>> checking. They are available both in Clang-Tidy and CSA. And I 
>>>>>> wanted to use existing functionality as much as possible. So, I 
>>>>>> decided to extend an existing API to make its usage seamless 
>>>>>> across different kinds of checks: path-sensitive, AST-based and 
>>>>>> CFG-based.
>>>>>>
>>>>>> 2. AST matchers effectively help clients to avoid a lot of 
>>>>>> checking of dyn_cast results. This feature not only makes them 
>>>>>> more convenient but also more safe because, in my experience, 
>>>>>> forgetting a nullptr/None check is a pretty common mistake for 
>>>>>> checker writers.
>>>>>>
>>>>>> 3. Testing of AST matchers don't require writing C++ code - it 
>>>>>> can be done interactively with clang-query tool. And I believe 
>>>>>> that we need similar functionality for CSA as well.
>>>>>
>>>>> Moreover, new matchers could be written without modifying clang. I 
>>>>> wonder if we should support some kind of a plugin infrastructure 
>>>>> which support matchers
>>>>> defined in a text file, e.g. something along the lines of:
>>>>>
>>>>> clang -cc1 —analyze -analyzer-checker=core,mymatcher 
>>>>> -analyzer-load-declarative-checker=mymatcher.txt
>>>> I planned to reuse dynamic matchers infrastructure and clang-query 
>>>> for this. clang-query is a command-line tool but there shouldn't be 
>>>> too much difference where the matcher string we parse comes from. 
>>>> The problem here is that, AFAIR, using callbacks with dynamic 
>>>> matchers is impossible. Still it is a very cool tool for quick 
>>>> matcher prototyping.
>>>>
>>>>>
>>>>>>
>>>>>> I didn't want to replace existing checker API. Instead, I tried 
>>>>>> to make new possibilities usable independently or together with 
>>>>>> existing.
>>>>>>
>>>>>> --------- Design and implementation ---------
>>>>>>
>>>>>> The implementation consisted of a number of steps.
>>>>>>
>>>>>> 1. Allow matching nodes of path-sensitive graph like usual AST 
>>>>>> nodes. To make this possible, three actions were needed:
>>>>>>  1.1. ASTTypeTraits and DynTypedNode were extended to support 
>>>>>> path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, 
>>>>>> etc. The implementation with graph node support is moved into a 
>>>>>> separate class (ASTGraphTypeTraits) in ento namespace to resolve 
>>>>>> cyclic dependencies (they are still exist, unfortunately, but are 
>>>>>> header-only, so we can build the PoC).
>>>>>>  1.2. Some additions to AST matchers were made to add support for 
>>>>>> new kinds of nodes.
>>>>>>  1.3. To make MatchFinder able to query specific options not 
>>>>>> supported by pure AST, it was augmented with "Contexts". A 
>>>>>> matcher that needs to query the path-sensitive engine asks the 
>>>>>> Finder for the required Context which provides specific helper 
>>>>>> functions.
>>>>>>
>>>>>> As the result of this step, we are able to write matchers like 
>>>>>> expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).
>>>>>>
>>>>>> 2. Create an engine for graph exploration and matching.
>>>>>> Unlike normal AST matchers, hasSequence is a path-sensitive 
>>>>>> matcher. It is launched against ExplodedGraph. These matchers are 
>>>>>> handled by GraphMatchFinder object. While searching a graph, it 
>>>>>> collects matches. Each match contains a pointer to the 
>>>>>> corresponding matcher and State ID of this match. The way how 
>>>>>> matches are translated from one state ID to another is determined 
>>>>>> by matcher operators.
>>>>>>
>>>>>> For example, SequenceVariadicOperator, which is the base of 
>>>>>> hasSequence() matcher, has "positive" and "negative" sub-matches. 
>>>>>> Each positive matcher has its corresponding State ID. In order to 
>>>>>> advance to the next State ID, a node being matched should match 
>>>>>> all "negative" matchers before the next "positive" matchers and 
>>>>>> the next "positive" matcher itself. Failure in matching 
>>>>>> "negative" matcher discards the match.
>>>>>>
>>>>>> The role of GraphMatchFinder is similar to MatchFinder: it is 
>>>>>> only responsible for graph exploration and keeping bound nodes 
>>>>>> and matchers.
>>>>>>
>>>>>> 3. Manage bound nodes.
>>>>>> While matching a single graph node, BoundNodes from AST 
>>>>>> MatchFinder are used. AST matchers for path-sensitive nodes 
>>>>>> support bindings out-of-the-box. However, the situation with 
>>>>>> graph matching is a bit different. For graph matching, we have 
>>>>>> another system of bound nodes. Each graph node has a related map 
>>>>>> of bounds aka GDMTy (yes, the name is not a coincidence). GDMTy 
>>>>>> is a mapping from match ID to BoundNodesMap which, in part, is 
>>>>>> effectively a map from std::string to DynTypedNodes. This look 
>>>>>> pretty much like how GDM is organized in ExplodedGraph, just 
>>>>>> without one level of indirection (it can be added, though).
>>>>>>
>>>>>> MatchFinder contexts should allow support of their own bindings. 
>>>>>> This is how equalsBoundNode() matcher works for path-sensitive 
>>>>>> nodes: it just queries all available contexts for the binding.
>>>>>>
>>>>>> Finally, I have managed to make two checkers work in this way: 
>>>>>> ChrootChecker (which is present in the introduction) and 
>>>>>> TestAfterDivZeroChecker. Both them can be found in 
>>>>>> ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp 
>>>>>> correspondingly. They act like normal checkers: produce warnings 
>>>>>> and use visitors. The main difference is that they cannot 
>>>>>> generate sinks and use checkEndAnalysis callback. The code of 
>>>>>> these checkers can be found here:
>>>>>>
>>>>>> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
>>>>>> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp
>>>>>>
>>>>>>
>>>>>> -------- Some features --------
>>>>>>
>>>>>> 1.The design of bindings has an interesting consequence: it 
>>>>>> enables the dynamic introspection of GDM which was pretty hard 
>>>>>> before. (Hello alpha renaming?)
>>>>>> 2. Nothing prevents matchers to be used with existing checker API 
>>>>>> for simplifying conditional checking inside callbacks. The 
>>>>>> matchers are not planned as the replacement for the current API, 
>>>>>> but look like a nice complementary part.
>>>>>> 3. Implicit conversion of Matcher<ProgramPoint> to 
>>>>>> Matcher<ExplodedNode> and Matcher<SymExpr || MemRegion> to 
>>>>>> Matcher<SVal> for writing shorter code.
>>>>>>
>>>>>> -------- Not implemented/not checked yet --------
>>>>>> I tried to keep the PoC as minimal as possible. As the result, 
>>>>>> some features are not implemented yet, but I believe they can be 
>>>>>> added.
>>>>>> 1. Usage of matchers inside checker callbacks
>>>>>> This is not exactly unimplemented, but is still untested.
>>>>>> 2. Dynamic matchers (clang-query interface)
>>>>>> In addition to work for supporting dynamic nodes (I don't know 
>>>>>> how many efforts it can take), we need to enable matching against 
>>>>>> AST nodes, not graph. But it doesn't look as a problem, we can 
>>>>>> just make matchers polymorphic.
>>>>>> 3. Binding to successfully matched paths is not implemented yet - 
>>>>>> only bindings to separate nodes. I wonder if we need path 
>>>>>> bindings at all.
>>>>>> 4. Some corner cases are still FIXMEs: single-node sequences, for 
>>>>>> example.
>>>>>> 5. GDM is not based on immutable data structures like it is done 
>>>>>> in CSA - it is just an STL map. Its performance can be poor (full 
>>>>>> copy on every new node), but I don't think that changing it to 
>>>>>> use immutable structures is hard.
>>>>>> 6. Matching on-the-fly
>>>>>> GraphMatchFinder should support on-the-fly matching during 
>>>>>> ExplodedGraph building. For this support, we should just call its 
>>>>>> advance() method each time a new node is created. However, this 
>>>>>> possibility wasn't checked yet.
>>>>>> 7. Matching CFG and CallGraph isn't implemented because it was 
>>>>>> considered to be far out of simple PoC.
>>>>>> 8. Only sequential matching is available now, and I didn't try to 
>>>>>> implement other operators yet. So, implementing a checker similar 
>>>>>> to PthreadLock can be tricky or even impossible for now.
>>>>>>
>>>>>> -------- Known and potential issues --------
>>>>>> From matchers' side:
>>>>>> 1. Some tests don't pass because they rely on the checker ability 
>>>>>> to generate sink nodes. Our matchers cannot do it by design so 
>>>>>> tests don't pass.
>>>>>
>>>>> Actually I would disagree with this one: I think it would be 
>>>>> better to support that, since many errors are considered fatal.
>>>>> (but that of course would require running the checkers on a stream 
>>>>> of events, rather than on the already constructed graph)
>>>> As I answered to Artem, this is technically possible (but still 
>>>> undesirable, I think) because matchers fire their given callbacks 
>>>> on match, and callbacks are much less limited in their possibilities.
>>>>
>>>>>
>>>>>> 2. Representing checker bindings as a map can be less effective 
>>>>>> than storing data in structures. I didn't measure the overhead, 
>>>>>> so I cannot give any numbers.
>>>>>> 3. Matchers are called on every node independently of its type. 
>>>>>> This is not what CheckerManager does. I wonder if this detail can 
>>>>>> affect performance as well.
>>>>>
>>>>> I would guess it would affect it hugely, and getting the 
>>>>> performance right would be the biggest challenge for declarative 
>>>>> matchers.
>>>> That's sad. I'll try to measure the overhead but I'm not sure I'll 
>>>> be able to do it soon.
>>>>
>>>>>
>>>>>> 4. Problems with cyclic dependencies. clangStaticAnalyzer has a 
>>>>>> dependency on clangASTMatchers, therefore, clangASTMatchers 
>>>>>> cannot depend on clangStaticAnalyzer. However, if we want 
>>>>>> ASTMatchers to support static analyzer data structures, there 
>>>>>> should be a backward dependency. Now this dependency is header-only.
>>>>>> 5. Code duplication. This is mostly fine for a sandbox but needs 
>>>>>> to be reduced.
>>>>>>
>>>>>> From analyzer's side:
>>>>>> 1. Many events are not reflected in the final ExplodedGraph. For 
>>>>>> example, checkers can receive PointerEscape event, but the event 
>>>>>> itself will not be recorded in the ExplodedGraph - only changes 
>>>>>> made by checkers will. That's also true for Stmt nodes: I noticed 
>>>>>> same issues with PostCondition. This makes matching a bit harder. 
>>>>>> Should we add some information into ExplodedGraph?
>>>>>> 2. (Minor) Some static analyzer data structures don't support 
>>>>>> traditional LLVM casting methods (dyn_cast, isa) because they 
>>>>>> lack classof() method. I have fixed this internally and will put 
>>>>>> a patch on review soon.
>>>>>>
>>>>>> -------- Conclusion --------
>>>>>> Finally, there is a graph matching engine supporting basic 
>>>>>> functionality and two checkers using it. I apologize for not 
>>>>>> starting the discussion earlier, but I just wasn't sure that the 
>>>>>> approach will work. Is anybody interested to have this stuff in 
>>>>>> upstream (maybe, changed or improved)? If yes, I'll be happy to 
>>>>>> contribute this work patch-by-patch and continue its improvement. 
>>>>>> If no - well, I had enough fun playing with it.
>>>>>>
>>>>>> _______________________________________________
>>>>>> cfe-dev mailing list
>>>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>>
>> -- 
>> Best regards,
>> Aleksei Sidorin,
>> SRR, Samsung Electronics
>>
>> <Mail Attachment.gif>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180719/d9fdae6f/attachment.html>