[cfe-dev] [analyzer] Proof-of-concept for Matcher-like checker API (graph matchers)

Mon Jul 2 01:39:30 PDT 2018

Hi George,

Sorry for late response.

16.06.2018 05:14, George Karpenkov via cfe-dev пишет:
>
>
>> On May 22, 2018, at 4:37 PM, Alexey Sidorin <alexey.v.sidorin at ya.ru 
>> <mailto:alexey.v.sidorin at ya.ru>> wrote:
>>
>> 22.05.2018 04:35, George Karpenkov пишет:
>>> Hi Alexey,
>>>
>>> As classics say, "Any sufficiently complicated C or Fortran program 
>>> contains an ad-hoc, informally-specified, bug-ridden, slow 
>>> implementation of half of Common Lisp.”
>> Yes,https://xkcd.com/297/is what I was always thinking while writing 
>> new AST matchers. Time-honored tradition :)
>>>
>>> Jokes aside, I think the concept and what you are doing is great,
>>> and we could certainly benefit from declarative matchers.
>>> However, I think the actual implementation and the set of matchers 
>>> and predicates would require quite a bit of bikeshedding.
>> Sure. If we need this stuff - the design details are subject for 
>> discussion and change.
>>
>>>
>>> Are you familiar with similar works in this area?
>>> E.g. Oracle has a Soufflé project doing a similar task: 
>>> http://souffle-lang.org <http://souffle-lang.org/>.
>>> They have found achieving high performance very challenging, and 
>>> IIRC they resort to generating C++ code from the declaratively 
>>> described matcher.
>> My goal was not to achieve extreme performance but not to hurt 
>> analyzer's performance too much. Don't know how far I am from this 
>> target, unfortunately.
>>
>>> Maybe we would have to do the same in tablegen spirit.
>>>
>>> I’m sure there are more such works in the literature.
>> Yes, I know about some researches in this area, but thank you anyway!
>>>
>>>> On May 16, 2018, at 4:37 PM, Alexey Sidorin via cfe-dev 
>>>> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I'd like to share some results of my investigation targeted on 
>>>> improvement of Clang Static Analyzer checker API. You can find some 
>>>> previous conversation on this topic 
>>>> here:http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html. 
>>>> In my investigation, I tried to solve a particular problem of 
>>>> building a checker without generating new nodes.
>>>
>>> I agree that this is a worthy goal.
>>>
>>>>
>>>> --------- Introduction and design goals ---------
>>>>
>>>> In brief, I tried to use matchers-like API to make CSA checkers 
>>>> look like this:
>>>>
>>>> StatementMatcher NotChdir =
>>>> callExpr(unless(callee(functionDecl(hasName("::chdir")))));
>>>> Finder.addMatcher(
>>>> hasSequence(
>>>> postStmt(hasStatement(
>>>> callExpr(callee(functionDecl(hasName("::chroot")))))),
>>>> unless(stmtPoint(hasStatement(callExpr(
>>>> callee(functionDecl(hasName("::chdir"))),
>>>> hasArgument(0, hasValue(stringRegion(refersString("/")))))))),
>>>> explodedNode(anyOf(postStmt(hasStatement(NotChdir)),
>>>> callEnter(hasCallExpr(NotChdir))))
>>>> .bind("bug_node")),
>>>> &Callback);
>>>> Finder.match(G);
>>>
>>> Bikeshedding: I think it’s much better for the sequence matcher to 
>>> run from start to end,
>>> since that’s how we think about the execution of a program.
>> Not sure that I understand completely what do you mean. The nodes are 
>> written in the same sequence as they should be on the path in 
>> ExplodedGraph.
>
> Yeah, but that doesn’t matter, right? We need to get a good API for 
> checker writers, and the events are generally described in their 
> progression as going forward.
Sorry, I still didn't get you. The events in the matcher are already 
described in the order they should happen to report a bug:
1. Call to `chroot`.
2. Do not call `chdir("/")`.
3. Call anything but `chdir("/").
Or do you mean something else?

> Also “explodedNode” is probably an unfortunate name, and “node” would 
> be better.
Naming is always a subject to change.
>
> I am also confused by the semantics of your list, since it mixes AST 
> and “node” statements.
In this example, statement matchers are implicitly converted to exploded 
node matchers and are applied only to ExplodedNodes that have StmtPoint 
as their ProgramPoint. Initially, I was thinking it is a good idea but 
soon discovered that I was wrong because it causes a mess between all 
kinds of StmtPoint. This implicit conversions is likely to be removed. 
There is also another implicit conversion from ProgramPoint matcher to 
ExplodedNode matcher and I found it to be much more useful.
> Why is only the last matcher wrapped in “node”?
All sub-matchers of `hasSequence()` are matchers on ExplodedNode, but we 
need to emit a warning on the last node, so we need something to bind 
against. The bound node is used to emit a warning on it.

>
>>>
>>>>
>>>> and I have managed to make some simple working examples.
>>>>
>>>> The entire diff can be found 
>>>> here:https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f
>>>> The code itself:https://github.com/a-sid/clang/tree/a4
>>>>
>>>> There are several reasons why I have tried this approach.
>>>>
>>>> 1. AST Matchers are already extensively used API for AST checking. 
>>>> They are available both in Clang-Tidy and CSA. And I wanted to use 
>>>> existing functionality as much as possible. So, I decided to extend 
>>>> an existing API to make its usage seamless across different kinds 
>>>> of checks: path-sensitive, AST-based and CFG-based.
>>>>
>>>> 2. AST matchers effectively help clients to avoid a lot of checking 
>>>> of dyn_cast results. This feature not only makes them more 
>>>> convenient but also more safe because, in my experience, forgetting 
>>>> a nullptr/None check is a pretty common mistake for checker writers.
>>>>
>>>> 3. Testing of AST matchers don't require writing C++ code - it can 
>>>> be done interactively with clang-query tool. And I believe that we 
>>>> need similar functionality for CSA as well.
>>>
>>> Moreover, new matchers could be written without modifying clang. I 
>>> wonder if we should support some kind of a plugin infrastructure 
>>> which support matchers
>>> defined in a text file, e.g. something along the lines of:
>>>
>>> clang -cc1 —analyze -analyzer-checker=core,mymatcher 
>>> -analyzer-load-declarative-checker=mymatcher.txt
>> I planned to reuse dynamic matchers infrastructure and clang-query 
>> for this. clang-query is a command-line tool but there shouldn't be 
>> too much difference where the matcher string we parse comes from. The 
>> problem here is that, AFAIR, using callbacks with dynamic matchers is 
>> impossible. Still it is a very cool tool for quick matcher prototyping.
>>
>>>
>>>>
>>>> I didn't want to replace existing checker API. Instead, I tried to 
>>>> make new possibilities usable independently or together with existing.
>>>>
>>>> --------- Design and implementation ---------
>>>>
>>>> The implementation consisted of a number of steps.
>>>>
>>>> 1. Allow matching nodes of path-sensitive graph like usual AST 
>>>> nodes. To make this possible, three actions were needed:
>>>>  1.1. ASTTypeTraits and DynTypedNode were extended to support 
>>>> path-sensitive nodes: ExplodedNode, ProgramState, SVal, SymExpr, 
>>>> etc. The implementation with graph node support is moved into a 
>>>> separate class (ASTGraphTypeTraits) in ento namespace to resolve 
>>>> cyclic dependencies (they are still exist, unfortunately, but are 
>>>> header-only, so we can build the PoC).
>>>>  1.2. Some additions to AST matchers were made to add support for 
>>>> new kinds of nodes.
>>>>  1.3. To make MatchFinder able to query specific options not 
>>>> supported by pure AST, it was augmented with "Contexts". A matcher 
>>>> that needs to query the path-sensitive engine asks the Finder for 
>>>> the required Context which provides specific helper functions.
>>>>
>>>> As the result of this step, we are able to write matchers like 
>>>> expr(hasArgument(0, hasValue(stringRegion(refersString("/"))))).
>>>>
>>>> 2. Create an engine for graph exploration and matching.
>>>> Unlike normal AST matchers, hasSequence is a path-sensitive 
>>>> matcher. It is launched against ExplodedGraph. These matchers are 
>>>> handled by GraphMatchFinder object. While searching a graph, it 
>>>> collects matches. Each match contains a pointer to the 
>>>> corresponding matcher and State ID of this match. The way how 
>>>> matches are translated from one state ID to another is determined 
>>>> by matcher operators.
>>>>
>>>> For example, SequenceVariadicOperator, which is the base of 
>>>> hasSequence() matcher, has "positive" and "negative" sub-matches. 
>>>> Each positive matcher has its corresponding State ID. In order to 
>>>> advance to the next State ID, a node being matched should match all 
>>>> "negative" matchers before the next "positive" matchers and the 
>>>> next "positive" matcher itself. Failure in matching "negative" 
>>>> matcher discards the match.
>>>>
>>>> The role of GraphMatchFinder is similar to MatchFinder: it is only 
>>>> responsible for graph exploration and keeping bound nodes and matchers.
>>>>
>>>> 3. Manage bound nodes.
>>>> While matching a single graph node, BoundNodes from AST MatchFinder 
>>>> are used. AST matchers for path-sensitive nodes support bindings 
>>>> out-of-the-box. However, the situation with graph matching is a bit 
>>>> different. For graph matching, we have another system of bound 
>>>> nodes. Each graph node has a related map of bounds aka GDMTy (yes, 
>>>> the name is not a coincidence). GDMTy is a mapping from match ID to 
>>>> BoundNodesMap which, in part, is effectively a map from std::string 
>>>> to DynTypedNodes. This look pretty much like how GDM is organized 
>>>> in ExplodedGraph, just without one level of indirection (it can be 
>>>> added, though).
>>>>
>>>> MatchFinder contexts should allow support of their own bindings. 
>>>> This is how equalsBoundNode() matcher works for path-sensitive 
>>>> nodes: it just queries all available contexts for the binding.
>>>>
>>>> Finally, I have managed to make two checkers work in this way: 
>>>> ChrootChecker (which is present in the introduction) and 
>>>> TestAfterDivZeroChecker. Both them can be found in 
>>>> ChrootCheckerV2.cpp and TestAfterDivZeroCheckerV2.cpp 
>>>> correspondingly. They act like normal checkers: produce warnings 
>>>> and use visitors. The main difference is that they cannot generate 
>>>> sinks and use checkEndAnalysis callback. The code of these checkers 
>>>> can be found here:
>>>>
>>>> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp
>>>> https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp
>>>>
>>>>
>>>> -------- Some features --------
>>>>
>>>> 1.The design of bindings has an interesting consequence: it enables 
>>>> the dynamic introspection of GDM which was pretty hard before. 
>>>> (Hello alpha renaming?)
>>>> 2. Nothing prevents matchers to be used with existing checker API 
>>>> for simplifying conditional checking inside callbacks. The matchers 
>>>> are not planned as the replacement for the current API, but look 
>>>> like a nice complementary part.
>>>> 3. Implicit conversion of Matcher<ProgramPoint> to 
>>>> Matcher<ExplodedNode> and Matcher<SymExpr || MemRegion> to 
>>>> Matcher<SVal> for writing shorter code.
>>>>
>>>> -------- Not implemented/not checked yet --------
>>>> I tried to keep the PoC as minimal as possible. As the result, some 
>>>> features are not implemented yet, but I believe they can be added.
>>>> 1. Usage of matchers inside checker callbacks
>>>> This is not exactly unimplemented, but is still untested.
>>>> 2. Dynamic matchers (clang-query interface)
>>>> In addition to work for supporting dynamic nodes (I don't know how 
>>>> many efforts it can take), we need to enable matching against AST 
>>>> nodes, not graph. But it doesn't look as a problem, we can just 
>>>> make matchers polymorphic.
>>>> 3. Binding to successfully matched paths is not implemented yet - 
>>>> only bindings to separate nodes. I wonder if we need path bindings 
>>>> at all.
>>>> 4. Some corner cases are still FIXMEs: single-node sequences, for 
>>>> example.
>>>> 5. GDM is not based on immutable data structures like it is done in 
>>>> CSA - it is just an STL map. Its performance can be poor (full copy 
>>>> on every new node), but I don't think that changing it to use 
>>>> immutable structures is hard.
>>>> 6. Matching on-the-fly
>>>> GraphMatchFinder should support on-the-fly matching during 
>>>> ExplodedGraph building. For this support, we should just call its 
>>>> advance() method each time a new node is created. However, this 
>>>> possibility wasn't checked yet.
>>>> 7. Matching CFG and CallGraph isn't implemented because it was 
>>>> considered to be far out of simple PoC.
>>>> 8. Only sequential matching is available now, and I didn't try to 
>>>> implement other operators yet. So, implementing a checker similar 
>>>> to PthreadLock can be tricky or even impossible for now.
>>>>
>>>> -------- Known and potential issues --------
>>>> From matchers' side:
>>>> 1. Some tests don't pass because they rely on the checker ability 
>>>> to generate sink nodes. Our matchers cannot do it by design so 
>>>> tests don't pass.
>>>
>>> Actually I would disagree with this one: I think it would be better 
>>> to support that, since many errors are considered fatal.
>>> (but that of course would require running the checkers on a stream 
>>> of events, rather than on the already constructed graph)
>> As I answered to Artem, this is technically possible (but still 
>> undesirable, I think) because matchers fire their given callbacks on 
>> match, and callbacks are much less limited in their possibilities.
>>
>>>
>>>> 2. Representing checker bindings as a map can be less effective 
>>>> than storing data in structures. I didn't measure the overhead, so 
>>>> I cannot give any numbers.
>>>> 3. Matchers are called on every node independently of its type. 
>>>> This is not what CheckerManager does. I wonder if this detail can 
>>>> affect performance as well.
>>>
>>> I would guess it would affect it hugely, and getting the performance 
>>> right would be the biggest challenge for declarative matchers.
>> That's sad. I'll try to measure the overhead but I'm not sure I'll be 
>> able to do it soon.
>>
>>>
>>>> 4. Problems with cyclic dependencies. clangStaticAnalyzer has a 
>>>> dependency on clangASTMatchers, therefore, clangASTMatchers cannot 
>>>> depend on clangStaticAnalyzer. However, if we want ASTMatchers to 
>>>> support static analyzer data structures, there should be a backward 
>>>> dependency. Now this dependency is header-only.
>>>> 5. Code duplication. This is mostly fine for a sandbox but needs to 
>>>> be reduced.
>>>>
>>>> From analyzer's side:
>>>> 1. Many events are not reflected in the final ExplodedGraph. For 
>>>> example, checkers can receive PointerEscape event, but the event 
>>>> itself will not be recorded in the ExplodedGraph - only changes 
>>>> made by checkers will. That's also true for Stmt nodes: I noticed 
>>>> same issues with PostCondition. This makes matching a bit harder. 
>>>> Should we add some information into ExplodedGraph?
>>>> 2. (Minor) Some static analyzer data structures don't support 
>>>> traditional LLVM casting methods (dyn_cast, isa) because they lack 
>>>> classof() method. I have fixed this internally and will put a patch 
>>>> on review soon.
>>>>
>>>> -------- Conclusion --------
>>>> Finally, there is a graph matching engine supporting basic 
>>>> functionality and two checkers using it. I apologize for not 
>>>> starting the discussion earlier, but I just wasn't sure that the 
>>>> approach will work. Is anybody interested to have this stuff in 
>>>> upstream (maybe, changed or improved)? If yes, I'll be happy to 
>>>> contribute this work patch-by-patch and continue its improvement. 
>>>> If no - well, I had enough fun playing with it.
>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>
>>
>
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-- 
Best regards,
Aleksei Sidorin,
SRR, Samsung Electronics

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180702/6a7df2a5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 13168 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180702/6a7df2a5/attachment.gif>