<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
On 5/17/18 5:28 AM, Gábor Horváth wrote:<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>Hi Alexey,</div>
<div><br>
</div>
<div>In general, I like the idea of having a more declarative
way to define new</div>
<div>checks.<br>
</div>
<div><br>
</div>
<div>
<div class="gmail_quote">
<div dir="ltr">On Thu, 17 May 2018 at 01:37, Alexey Sidorin
<<a href="mailto:alexey.v.sidorin@ya.ru"
moz-do-not-send="true">alexey.v.sidorin@ya.ru</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hello
everyone,<br>
<br>
I'd like to share some results of my investigation
targeted on <br>
improvement of Clang Static Analyzer checker API. You can
find some <br>
previous conversation on this topic here: <br>
<a
href="http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://clang-developers.42468.n3.nabble.com/analyzer-RFC-Design-idea-separate-modelling-from-checking-td4059122.html</a>.
<br>
In my investigation, I tried to solve a particular problem
of building a <br>
checker without generating new nodes.<br>
</blockquote>
<div><br>
</div>
<div>Why do you focus on such checks that does not have
traits, does not generate new nodes. <br>
</div>
<div>Do you find this to be the majority of the checks you
need to implement? <br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
Ok, so as far as i understand, these new checkers don't support
traits because they don't *need* them. Instead, they plan to use
"backreferences" in matchers, i.e. double file descriptor close can
be described as:<br>
<br>
hasSequence(<br>
call(hasName("fclose"), hasArg(0, sym().bind("fd"))),<br>
call(hasName("fclose"), hasArg(0, sym(equalsBoundNode("fd"))))<br>
);<br>
<br>
Right?<br>
<br>
So we can't eliminate the path on which the pointer is null after we
dereference it, like we normally do, we can't introduce state
splits, we can't produce fatal error nodes, but we can still use
finite state machines.<br>
<br>
The real question is, how do we implement RetainCountChecker that
needs to count the potentially infinite number of retains and
releases before emitting the warning, and is therefore not a finite
state machine?^^<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
--------- Introduction and design goals ---------<br>
<br>
In brief, I tried to use matchers-like API to make CSA
checkers look <br>
like this:<br>
<br>
StatementMatcher NotChdir =<br>
callExpr(unless(callee(functionDecl(hasName("::chdir")))));<br>
Finder.addMatcher(<br>
hasSequence(<br>
postStmt(hasStatement(<br>
callExpr(callee(functionDecl(hasName("::chroot")))))),<br>
unless(stmtPoint(hasStatement(callExpr(<br>
callee(functionDecl(hasName("::chdir"))),<br>
hasArgument(0,
hasValue(stringRegion(refersString("/")))))))),<br>
explodedNode(anyOf(postStmt(hasStatement(NotChdir)),<br>
callEnter(hasCallExpr(NotChdir))))<br>
.bind("bug_node")),<br>
&Callback);<br>
Finder.match(G);<br>
</blockquote>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
Aha, so am i understanding correctly that unless() here is not your
normal unless()? The normal unless() clause would have meant<br>
<br>
"somewhere between chroot() and the offending call there's a
node that's not chdir()"<br>
<br>
but what we're trying to say is<br>
<br>
"nowhere between chroot() and the offending call there's a node
that's chdir()"<br>
<br>
?<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<div>I think, a common (design) pitfall of writing checks is
to try to match against</div>
<div>the AST when a symbolic execution callback should be
used instead.</div>
<div>I am a bit afraid having an API like this would make
this pitfall more common.</div>
<div>Maybe a separation between the path sensitive, flow
sensitive and</div>
<div>AST based checks is good for the checker writers and
new users and</div>
<div>I am not sure that the same abstraction fits all of
these cases.</div>
<div><br>
</div>
<div>In case of path sensitive checks I wonder if a state
machine like abstraction would</div>
<div>fit the use cases better (and also cover checks that
are using traits)</div>
<div>where the guarding conditions of the state transitions
can include AST matcher like checks. <br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
and I have managed to make some simple working examples.<br>
<br>
The entire diff can be found here: <br>
<a
href="https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/a-sid/clang/commit/9a0b1d1d9b3cf41b551a663f041f54d5427aa72f</a><br>
The code itself: <a
href="https://github.com/a-sid/clang/tree/a4"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/a-sid/clang/tree/a4</a><br>
<br>
There are several reasons why I have tried this approach.<br>
<br>
1. AST Matchers are already extensively used API for AST
checking. They <br>
are available both in Clang-Tidy and CSA. And I wanted to
use existing <br>
functionality as much as possible. So, I decided to extend
an existing <br>
API to make its usage seamless across different kinds of
checks: <br>
path-sensitive, AST-based and CFG-based.<br>
</blockquote>
</div>
</div>
</div>
</blockquote>
<br>
When we say CFG-based checks, we usually mean "all paths" checks
that find bugs indicated by a certain invariant holding on all paths
through a certain program point(s). Such as TestAfterDivideZero or
DeadStore.<br>
<br>
The problem with CFG-based "all paths" checks that i don't really
know how to solve and that's probably not going to be solved by
giving them a good API is error reporting. Imagine a conversation
with a user:<br>
<br>
A n a l y z e r : Hey this assignment is not used.<br>
<br>
U s e r : But hey, it's used here *points to the code*.<br>
<br>
A n a l y z e r : But this code is dead.<br>
<br>
U s e r : Why so? Suppose 'x' is equal to true, we jump straight
into this code.<br>
<br>
A n a l y z e r : That's because the initializer for 'x' is in
fact always evaluated to false.<br>
<br>
U s e r : WHY t___t<br>
<br>
Etc. The first piece of information is produced by a DeadStore
checker. The second piece of information is produced by some sort of
DeadCode checker (currently alpha). The third piece of information
is produced by some sort of SameResLogicalExpr checker (currently in
the list of potential checkers). All three checkers are all-paths
checkers. And unless all three checker's diagnostics are presented
to the user at the same time, the user will be unable to understand
the bug report.<br>
<br>
I believe that we should really solve this problem somehow (or
explain to me why it isn't a problem) before we try to introduce a
massive increase in the number of CFG-based checkers we have
(currently 1).<br>
<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<div>I am not sure if this is a good idea. I think the
checker author supposed to</div>
<div>be aware if the check is AST based, flow sensitive, or
path sensitive. <br>
</div>
<div>And this should also be easy to tell by reading the
code. I am also not</div>
<div>sure whether there is an abstraction that fits all.</div>
<div><br>
</div>
<div>I think the reason why this idea works well for checks
that only inspect the</div>
<div>exploded graph, because in this case we are still doing
pattern matching on</div>
<div>graphs in a similar manner as doing pattern matching on
the AST.</div>
<div><br>
</div>
<div>But does this generalize later on to stateful checks? <br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
2. AST matchers effectively help clients to avoid a lot of
checking of <br>
dyn_cast results. This feature not only makes them more
convenient but <br>
also more safe because, in my experience, forgetting a
nullptr/None <br>
check is a pretty common mistake for checker writers.<br>
</blockquote>
<div><br>
</div>
<div>
<div>I think if something cannot be null we should return
references, otherwise</div>
<div>the client need to check for the pointer beeing null
(unless there are some</div>
<div>dynamic invariant that would make the check
redundant). If an API does</div>
<div>not match this philosophy, maybe we should fix the
API first. </div>
<div><br>
</div>
<div>Regardless of fixing the API, I agree that it would
be great to have higher</div>
<div>level APIs for checker authors.</div>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
3. Testing of AST matchers don't require writing C++ code
- it can be <br>
done interactively with clang-query tool. And I believe
that we need <br>
similar functionality for CSA as well.<br>
</blockquote>
<div><br>
</div>
Having a query tool for exploded graphs could be very
helpful, I agree.</div>
<div class="gmail_quote">These graphs tend to be very large
and sometimes it is not trivial to</div>
<div class="gmail_quote">find the relevant parts during
debugging.<br>
</div>
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
I didn't want to replace existing checker API. Instead, I
tried to make <br>
new possibilities usable independently or together with
existing.<br>
<br>
--------- Design and implementation ---------<br>
<br>
The implementation consisted of a number of steps.<br>
<br>
1. Allow matching nodes of path-sensitive graph like usual
AST nodes. To <br>
make this possible, three actions were needed:<br>
1.1. ASTTypeTraits and DynTypedNode were extended to
support <br>
path-sensitive nodes: ExplodedNode, ProgramState, SVal,
SymExpr, etc. <br>
</blockquote>
<div><br>
</div>
<div>I wonder, if in general it is a good idea to pattern
match in checks on SymExpr.</div>
<div>A more general approach here would be to use the
constraint solver for queries</div>
<div>in a similar manner how ProgramState::assume works.<br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
The implementation with graph node support is moved into a
separate <br>
class (ASTGraphTypeTraits) in ento namespace to resolve
cyclic <br>
dependencies (they are still exist, unfortunately, but are
header-only, <br>
so we can build the PoC).<br>
1.2. Some additions to AST matchers were made to add
support for new <br>
kinds of nodes.<br>
1.3. To make MatchFinder able to query specific options
not supported <br>
by pure AST, it was augmented with "Contexts". A matcher
that needs to <br>
query the path-sensitive engine asks the Finder for the
required Context <br>
which provides specific helper functions.<br>
<br>
As the result of this step, we are able to write matchers
like <br>
expr(hasArgument(0,
hasValue(stringRegion(refersString("/"))))).<br>
</blockquote>
</div>
</div>
</div>
</blockquote>
<br>
Yeah, we'll also eventually need operators over symbols, eg.:<br>
<br>
arraySubscriptExpr(hasSubscript(expr(hasSVal(sym(isLessThan(<br>
sym(equalsBoundNode("array_size")))).bind("array_subscript")))))<br>
<br>
Still, SVal matchers are something i always wanted. I even made SVal
visitors. The latter turned out to be pretty useless though :/<br>
<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<div>I think this is the part of the proposal that I like
the most, it would be</div>
<div>a very concise way to write down guarding conditions.<br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
2. Create an engine for graph exploration and matching.<br>
Unlike normal AST matchers, hasSequence is a
path-sensitive matcher. <br>
It is launched against ExplodedGraph. These matchers are
handled by <br>
GraphMatchFinder object. While searching a graph, it
collects matches. <br>
Each match contains a pointer to the corresponding matcher
and State ID <br>
of this match. The way how matches are translated from one
state ID to <br>
another is determined by matcher operators.<br>
</blockquote>
<div><br>
</div>
<div>Is this matching done after the exploded graph already
built? If so,</div>
<div>these checks will be unable to generate sink nodes. I
think having sink</div>
<div>nodes sometimes desirable even if the check itself does
not have a trait.</div>
</div>
</div>
</div>
</blockquote>
<br>
I'm really curious about how the performance scales with the number
of checkers. Is it going to be significantly more expensive to run
100 DSL-based checkers on a fully built graph than to run 100 normal
checkers as the graph is being built?<br>
<br>
This boils down to: by looking at the sub-matcher, will you be able
to filter out checkers that don't need to act upon the node as
quickly as you do that when calling checker callbacks?<br>
<br>
And this immediately leads to: how difficult would it be to simply
auto-generate an old-style checker (with state traits and callbacks)
from the new-style matcher? Maybe that's something we should focus
on instead of trying to match a fully constructed graph? Maybe it'd
also be more powerful, i.e. allow certain imperative side effects
whenever partial match is encountered?<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
For example, SequenceVariadicOperator, which is the
base of <br>
hasSequence() matcher, has "positive" and "negative"
sub-matches. Each <br>
positive matcher has its corresponding State ID. In order
to advance to <br>
the next State ID, a node being matched should match all
"negative" <br>
matchers before the next "positive" matchers and the next
"positive" <br>
matcher itself. Failure in matching "negative" matcher
discards the match.<br>
<br>
The role of GraphMatchFinder is similar to MatchFinder:
it is only <br>
responsible for graph exploration and keeping bound nodes
and matchers.<br>
<br>
3. Manage bound nodes.<br>
While matching a single graph node, BoundNodes from AST
MatchFinder <br>
are used. AST matchers for path-sensitive nodes support
bindings <br>
out-of-the-box. However, the situation with graph matching
is a bit <br>
different. For graph matching, we have another system of
bound nodes. <br>
Each graph node has a related map of bounds aka GDMTy
(yes, the name is <br>
not a coincidence). GDMTy is a mapping from match ID to
BoundNodesMap <br>
which, in part, is effectively a map from std::string to
DynTypedNodes. <br>
This look pretty much like how GDM is organized in
ExplodedGraph, just <br>
without one level of indirection (it can be added,
though).<br>
<br>
MatchFinder contexts should allow support of their own
bindings. This <br>
is how equalsBoundNode() matcher works for path-sensitive
nodes: it just <br>
queries all available contexts for the binding.<br>
<br>
Finally, I have managed to make two checkers work in this
way: <br>
ChrootChecker (which is present in the introduction) and <br>
TestAfterDivZeroChecker. Both them can be found in
ChrootCheckerV2.cpp <br>
and TestAfterDivZeroCheckerV2.cpp correspondingly. They
act like normal <br>
checkers: produce warnings and use visitors.</blockquote>
</div>
</div>
</div>
</blockquote>
<br>
I wonder if our false positive suppression visitors (or maybe even
regular checker bugvisitors) can be rewritten with these matchers.
trackNullOrUndefValue() and its system of visitors doesn't look like
a matcher at all to me, unfortunately - it can potentially track an
infinite amount of events. But every individual visitor within that
system probably is.<br>
<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">The main
difference is that <br>
they cannot generate sinks and use checkEndAnalysis
callback. The code <br>
of these checkers can be found here:<br>
<br>
<a
href="https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/ChrootCheckerV2.cpp</a><br>
<a
href="https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp"
rel="noreferrer" target="_blank" moz-do-not-send="true">https://github.com/a-sid/clang/blob/a4/lib/StaticAnalyzer/Checkers/TestAfterDivZeroCheckerV2.cpp</a><br>
<br>
<br>
-------- Some features --------<br>
<br>
1.The design of bindings has an interesting consequence:
it enables the <br>
dynamic introspection of GDM which was pretty hard before.
(Hello alpha <br>
renaming?)<br>
2. Nothing prevents matchers to be used with existing
checker API for <br>
simplifying conditional checking inside callbacks. The
matchers are not <br>
planned as the replacement for the current API, but look
like a nice <br>
complementary part.<br>
3. Implicit conversion of Matcher<ProgramPoint> to
Matcher<ExplodedNode> <br>
and Matcher<SymExpr || MemRegion> to
Matcher<SVal> for writing shorter code.<br>
<br>
-------- Not implemented/not checked yet --------<br>
I tried to keep the PoC as minimal as possible. As the
result, some <br>
features are not implemented yet, but I believe they can
be added.<br>
1. Usage of matchers inside checker callbacks<br>
This is not exactly unimplemented, but is still
untested.<br>
2. Dynamic matchers (clang-query interface)<br>
In addition to work for supporting dynamic nodes (I
don't know how <br>
many efforts it can take), we need to enable matching
against AST nodes, <br>
not graph. But it doesn't look as a problem, we can just
make matchers <br>
polymorphic.<br>
3. Binding to successfully matched paths is not
implemented yet - only <br>
bindings to separate nodes. I wonder if we need path
bindings at all.<br>
4. Some corner cases are still FIXMEs: single-node
sequences, for example.<br>
5. GDM is not based on immutable data structures like it
is done in CSA <br>
- it is just an STL map. Its performance can be poor (full
copy on every <br>
new node), but I don't think that changing it to use
immutable <br>
structures is hard.<br>
6. Matching on-the-fly<br>
GraphMatchFinder should support on-the-fly matching
during <br>
ExplodedGraph building. For this support, we should just
call its <br>
advance() method each time a new node is created. However,
this <br>
possibility wasn't checked yet.<br>
7. Matching CFG and CallGraph isn't implemented because it
was <br>
considered to be far out of simple PoC.<br>
8. Only sequential matching is available now, and I didn't
try to <br>
implement other operators yet. So, implementing a checker
similar to <br>
PthreadLock can be tricky or even impossible for now.<br>
<br>
-------- Known and potential issues --------<br>
From matchers' side:<br>
1. Some tests don't pass because they rely on the checker
ability to <br>
generate sink nodes. Our matchers cannot do it by design
so tests don't <br>
pass.<br>
2. Representing checker bindings as a map can be less
effective than <br>
storing data in structures. I didn't measure the overhead,
so I cannot <br>
give any numbers.<br>
3. Matchers are called on every node independently of its
type. This is <br>
not what CheckerManager does. I wonder if this detail can
affect <br>
performance as well.<br>
4. Problems with cyclic dependencies. clangStaticAnalyzer
has a <br>
dependency on clangASTMatchers, therefore,
clangASTMatchers cannot <br>
depend on clangStaticAnalyzer. However, if we want
ASTMatchers to <br>
support static analyzer data structures, there should be a
backward <br>
dependency. Now this dependency is header-only.<br>
5. Code duplication. This is mostly fine for a sandbox but
needs to be <br>
reduced.<br>
<br>
From analyzer's side:<br>
1. Many events are not reflected in the final
ExplodedGraph. For <br>
example, checkers can receive PointerEscape event, but the
event itself <br>
will not be recorded in the ExplodedGraph - only changes
made by <br>
checkers will.</blockquote>
</div>
</div>
</div>
</blockquote>
<br>
Weeeell maaaaaybe. This would essentially mean passing a
CheckerContext into checkPointerEscape and checkRegionChanges. With
a predecessor node having a new sort of ProgramPoint such as
"PointerEscapePoint". And we can put the list of escaped stuff
directly into the program point instead of passing it separately.<br>
<br>
Which means allowing state splits in these callbacks. Which of
course can be disallowed later (to keep the code that calls the
callbacks simple), and graph matchers won't need them any time soon
anyway.<br>
<br>
But it'd still be a lot of refactoring because currently these
callbacks are called from code that doesn't even know anything about
nodes, eg. from within State->invalidateRegions().<br>
<br>
<blockquote type="cite"
cite="mid:CAPRL4a0WncdqW394JFDZ7SdA0+bF0jgFrH7-wnfsN03mFADH9g@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">That's
also true for Stmt nodes: I noticed same issues <br>
with PostCondition. This makes matching a bit harder.
Should we add some <br>
information into ExplodedGraph?<br>
2. (Minor) Some static analyzer data structures don't
support <br>
traditional LLVM casting methods (dyn_cast, isa) because
they lack <br>
classof() method. I have fixed this internally and will
put a patch on <br>
review soon.<br>
<br>
-------- Conclusion --------<br>
Finally, there is a graph matching engine supporting basic
functionality <br>
and two checkers using it. I apologize for not starting
the discussion <br>
earlier, but I just wasn't sure that the approach will
work. Is anybody <br>
interested to have this stuff in upstream (maybe, changed
or improved)? <br>
If yes, I'll be happy to contribute this work
patch-by-patch and <br>
continue its improvement. If no - well, I had enough fun
playing with it.<br>
</blockquote>
<div><br>
</div>
<div>Thanks for sharing this results. Regardless of being
upstreamed as is or</div>
<div>in a separate form or not at all, this is an
interesting experiment for the</div>
<div>community. <br>
</div>
<div> </div>
</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>