[cfe-dev] Clang analyzer Google Summer of Code ideas/proposals

Samuel Harrington samuel.harrington at mines.sdsmt.edu
Wed Apr 7 15:40:05 PDT 2010


On Thu, Apr 1, 2010 at 4:18 PM, Ted Kremenek <kremenek at apple.com> wrote:
> Hi Sam,
>
> I think these are all great ideas.  Comments inline.
>
> On Mar 25, 2010, at 9:18 AM, Samuel Harrington wrote:
>>
...
>> B) User-made checkers
>>
>> This would provide some sort of easy extension mechanism to the
>> analyzer to allow simple domain-specific checks. I have a couple of
>> ideas of how this would look.
>
> I think having more ways to specify domain-specific checkers would be fantastic.
>
>>
>>
>> 1) The first would be to read and use mygcc [1] rules to detect bugs.
>> I believe this would would only provide simple flow-sensitive
>> analysis, but it looks useful nonetheless. This would require making a
>> pattern matcher to match ast nodes based on a parsed text expression.
>
> This would be extremely useful, and this has been requested a couple times.  It is also a well-scoped project, and I think it would make a great GSoC project.  Part of the work would also involve relaying useful diagnostics to user as well as having acceptable performance.
...
> For work done on the analyzer, I'd prefer GSoC projects that brings a new feature reasonably close to being usable by others.  If your project contains a set of milestones that deliver pieces of great functionality (e.g., mygcc support) on the way towards implementing some bigger feature then the project work is always a net win.  I'd be happy mentoring GSoC work on any of these project ideas as long as it had this kind of trajectory.



Thanks for your comments!

I've submitted a formal proposal to implement mygcc rules, visible at:
http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/samlh/t127053978725

I have also attached it in text form.

Any further comments or questions would be appreciated. My primary
concern is whether the proposed matching method is acceptable. I hope
this proposal is clear, complete, and interesting!

Thanks,
Samuel Harrington
-------------- next part --------------

Samuel Harrington


MyGCC rules for the Clang Analyzer
==================================


Me
--

I am currently a sophomore at the South Dakota School of Mines and Technology studying Mechanical Engineering and Computer Science. I have some C/C++, Perl, Javascript, and Prolog programming experience.

I have been interested in Clang and LLVM since I read an LWN article comment mentioning LLVM, and I've submitted a minor patch to Clang. I have read the Clang mailing list since 2008, and more recently subscibed to the commits list as well. I am most interested in the analyzer component.


Project Summary
---------------

This project would provide an easy extension mechanism to the Clang Analyzer to allow simple domain-specific checks. 

The analyzer would gain the ability to read and use mygcc [1] rules. Mygcc rules are "constrained reachability queries" of the form "from A to B avoid C". Mygcc rules were first implemented as a patch to gcc. The matching rules are in the form of code with %A style wildcards. These rules can be checked in linear time, and provide sufficient flexibility to create useful checks.


Details
-------

Mygcc implements its matching by unparsing the Abstract Syntax Tree of a statement, and matching it textually against the matching rules. The pros of this approach is that you do not need to parse the pattern in order to match against it. This is very useful because the addition of wildcards adds additional ambiguities to an already ambiguous language. The cons are that it depends on the exact pretty-printing of the AST, and requires the user to know what the pretty-printing is. In addition, it does not allow isomorphisms to be detected.

An alternative is to parse the pattern as C/C++ code (with wildcards) and create a bytecode/tree representation to match against. The pros of this approach are that the patterns can be independent of the ast format, and it is easier to implement isomorphisms in this form. The con is primarily that parsing C/C++ with the addition of wildcards is significantly harder than without, and Clang has already devoted significant time to the simpler approach.

Given the alternatives, I have decided it would be best to implement the approach mygcc uses. Rules originally implemented for mygcc may not be exactly interoperable due to differences in ASTs between GCC and Clang, but I believe this approach is the most pragmatic and likely to be effective.

The checker must:

 * find all the "from" matches
 * find all paths from the "from" matches to "to" matches, avoiding matching the "avoid" matches
 * if an avoid rule is an assertion, ensure that the assertion does not match any conditionals along the path

The required components are the Matcher and the Finder. The Matcher will be a class that, given a statement, rules to check, and current wildcard values, finds all the matches. For each match, it also stores what values the wildcards must take on. When matching a rule with already bound wildcards, the Matcher must ensure that if the wildcard references a variable, it has not changed since the last time it was matched.

The Matcher uses a technique called 'unparsed patterns' [2] for the user patterns. Unparsed patterns work by incrementally unparsing the AST. This technique is known to have linear match time, but needs parenthesis in rare situations. Despite this limitation, it is substantially easier than other approaches, both to use (no need to limit syntax to a subset), and to implement (no need to parse the pattern).

The Finder uses the Matcher and the following algorithm to find the bugs:
 1. Find all "from" matches
 2. For each "from" match, find all paths that end by matching the "to" rule, without matching the "avoid" rules.

The Matcher will use the already existing pretty-printer. Further details on the matching algorithm can be found in [2]. The Finder will use the existing flow-sensitive analysis infrastructure. Further details on its algorithm can be found at [3].


Milestones
----------

Dates TBD

1. Be able to match expressions with no wildcards

2. Be able to match rules with only "from" parts and no wildcards

3. Be able to match rules with only "from" parts with anonymous wildcards or named wildcards only appearing once

4. Be able to match rules with only "from" parts with named wildcards appearing multiple times

5. Be able to match rules with only "from", "to", and "avoid" parts with named wildcards appearing multiple times, but not including positive or negative assertions

6. Be able to match rules with only "from", "to", and "avoid" parts with named wildcards appearing multiple times, with assertions in the "avoid" parts


MyGCC Rule example
------------------

condate double_lock {
  from "lock(%L)" to "lock(%L)" avoid "unlock(%L)"
} warning("Lock L is locked twice");


References
----------

[1] http://mygcc.free.fr/

[2] http://mypatterns.free.fr/unparsed/unparsed-pepm08-extended.pdf

[3] http://mygcc.free.fr/condate-asej07.pdf


More information about the cfe-dev mailing list