[cfe-dev] Clang analyzer Google Summer of Code ideas/proposals

Thu Apr 1 15:18:39 PDT 2010

Hi Sam,

I think these are all great ideas.  Comments inline.

On Mar 25, 2010, at 9:18 AM, Samuel Harrington wrote:

> Hello,
> 
> I am interested in doing a project with Clang in the upcoming Google
> Summer of Code. I am currently a sophomore at the South Dakota School
> of Mines and Technology, and I have some C++, Perl, and Javascript
> programming experience. I have been interested in Clang and LLVM for a
> while, and I've looked through some of the code before. I am most
> interested in the analyzer component though.
> 
> 
> I have two possible project ideas I am interested in:
> 
> 
> A) Bug database
> 
> Create a tool to store bugs and track changes over time.
> 
> This tool would use the XML analyzer output and the CIndex library to
> correlate bugs over multiple runs. The tool would provide, at a
> minimum, a diff-like output given a pair of runs. Ideally, this would
> create and update a database with all the runs, and statuses for all
> the bugs (uninspected, false positive, verified, fixed). The tool
> would provide reports with chosen subsets of the bugs and annotations
> such as first run present and current status. The reports could be
> html output, reusing the existing infrastructure, or be viewable in a
> gui application.
> 
> The database could be XML, SQLite, or some plain-text format. I am
> unsure whether this tool should be integrated into the clang binary,
> be a separate executable, or even use a scripting language like
> Python. However it is implemented, it would be integrated into
> scan-build/scan-view.
> 
> I am interested in this project because it would make using the
> analyzer easier for larger projects. The diff output could be used as
> a regression finder or fix checker. The database would allow users to
> keep track of bugs better, and to provide statistics of bugs over
> time.

I think a bug database would be extremely useful, and the ability to correlate analysis results across runs would be really powerful.

Ideally the infrastructure for a bug database would be split into "backend" and "frontend" pieces, where the backend would be the core logic for processing results across runs and the frontend integration into something like scan-view.  This decoupling allows the database to be potentially be reused in other contexts, e.g. a Trac plugin.

One tricky aspect is dealing with correlating analysis results across an evolving codebase.  The code surrounding a bug may change but the bug would be the same.  This is an arbitrarily complicated problem, but correlating across runs should at least need to be not overly sensitive to line number changes, etc.

> B) User-made checkers
> 
> This would provide some sort of easy extension mechanism to the
> analyzer to allow simple domain-specific checks. I have a couple of
> ideas of how this would look.

I think having more ways to specify domain-specific checkers would be fantastic.

> 
> 
> 1) The first would be to read and use mygcc [1] rules to detect bugs.
> I believe this would would only provide simple flow-sensitive
> analysis, but it looks useful nonetheless. This would require making a
> pattern matcher to match ast nodes based on a parsed text expression.

This would be extremely useful, and this has been requested a couple times.  It is also a well-scoped project, and I think it would make a great GSoC project.  Part of the work would also involve relaying useful diagnostics to user as well as having acceptable performance.

> 2) Second, would be an interface to the analysis engines from a
> scripting language, perhaps python. This would be more complicated to
> use than mygcc, but likely more useful. For example, a check to make
> sure open has a third parameter if the CREATE flag is present is very
> simple given a scripting language, but impossible using mygcc rules
> [2].
> 
> If I was to do this project, I would likely try to do the second idea
> first, and if time permits, write a mygcc matcher in the scripting
> language. Implementing mygcc rules in the scripting language would
> provide a good test of the interface completeness.
> 
> I am interested in this because the clang analyzer could be easily
> extended with domain specific checks. For example, specialized locking
> rules could be checked using mygcc rules. A trickier example [3] would
> be to make sure a llvm::StringRef is not assigned a std::string that
> goes out of scope before it. This would be possible using a scripting
> language binding, and easier than modifying the Clang source. These
> types of checks are already being implemented in Clang, but it is
> infeasible for specialized checks for arbitrary given projects to be
> embedded. This project would be a way around the problem.

This is a far more ambitious project than the mygcc support.  As you say this has the potential to have a lot of impact, but there are a couple concerns that come to mind that might make this much bigger than a GSoC project:

1) The internal interface between the analyzer and the external plugin support would need to be well-defined.

2) What do you expose at the higher level?  There is both syntactic information (the ASTs) and semantic information (analysis state) that can be exposed to a checker.  Both sets of information are currently available to C++ checks that derive from the Checker class, and to build great checks both would need to be exposed at a higher level.  There is a lot of information to expose just for the llvm::StringRef check.

3) Performance.  The analyzer is very compute-intensive; will path-sensitive checks written in an external scripting language be too slow in practice when analyzing moderate to large codebases?  (this isn't a conclusion, just an open question)

4) Lots of infrastructure details including data management, etc., between the analysis core and the external checker.

My feeling is that this is a big project.  I think the work on the mygcc support would be a great starting point, as the bulk of the logic would be on the Clang-side, and then as you get experience working with the analysis engine you can gradually "move out" of Clang's interior and have plugins that interface with the analysis core.  The nice thing about tackling the smaller piece first is that (a) you would make steady progress instead of waiting for the "big feature" to get completed and (b) you will likely finish the GSoC project with a set of very usable pieces that can be used by users (even though a few big pieces that might not be finished).

> 3) The closest tool I have seen to #2 is Dehydra [4], which also has a
> goal of allowing user-defined bug finding scripts. A complicating
> factor is that the scripting language is Javascript, and it may be
> infeasible to provide a compatible interface. Nevertheless, I am
> including replicating the interface here as a third possibility.

Replicating DeHydra's interface might be very useful for leveraging some of its checks.  One big caveat I see is that this as has the challenges of (2) but also the additional burden that you are taking both a language *and* and a checker API that someone else has already defined and then try to match it to Clang's way of doing things.  I think this would be more feasible if the base infrastructure for (2) was already in place, but without it you are more at risk of not having time to finish the project.

For work done on the analyzer, I'd prefer GSoC projects that brings a new feature reasonably close to being usable by others.  If your project contains a set of milestones that deliver pieces of great functionality (e.g., mygcc support) on the way towards implementing some bigger feature then the project work is always a net win.  I'd be happy mentoring GSoC work on any of these project ideas as long as it had this kind of trajectory.