[PATCH] [Clang Static Analyzer] Bug identification

Mon Jun 15 14:07:30 PDT 2015

Thank you for working on this; this is a long needed improvement! Here are some high level comments.

> 1. File name

Should we include redundancy in the hash? The file name is already part of the report. The current approach avoids redundancy but has a post processing step, where the true identifier is computed. See "getIssueIdentifier" of clang/utils/analyzer/CmpRuns.py.

> 2. Content of the line where the bug is

> 

> •	So if anything changes in the close environment of the bug, it changes the ID. We think that it is likely that the changes in the same line will semantically >affect the bug.

Close environment of the bug could potentially include other nodes on the diagnostic path. Have you considered including those as well? (This is food for thought, not a blocker for this patch.)

Have you seen PGOHash (./lib/CodeGen/CodeGenPGO.cpp), which creates a hash of the AST? Having a more semantic approach would be more robust. For example, it could protect from variable rename and other refactoring.

> 4. Unique name of the checker

> 5. Optional Extra field

I believe these should be replaced with a bug identifier. The need for #6 highlights that the checker name is not sufficient here. A bug identifier is the most logical match for this info.

> Due to overloaded functions and copy pasted implementations, it is likely that the same fault is found in two different overloaded functions.

What happens for bugs reported in template code?

> In the current code there exists a similar identifier generator to the one suggested above. That implementation takes into consideration only

> •	the name of the enclosing scope 

> •	and the relative line number within the enclosing scope.

> This source of information is insufficient as the base of the hash for the reasons described above.

> We included a version of the hash in the name of the key (<key>bug_id_1</key>) in the PLIST output to identify the hash generator algorithm. This way it will >be possible to introduce a new hash calculation algorithm if needed.

I would be interested in either replacing "issue_hash" or adding "issue_hash_bug_line_content" (or something like it) instead of adding another completely differently named field with very similar information. I see no reason for having both. I am not sure if we have any users of "issue_hash" right now, who will suffer from the change. Maybe we could have "issue_hash", "issue_hash_1"(offset based), and "issue_hash_2"(content of line) and add another field "issue_hash_version" that describes the version "issue_hash" is using?

This needs tests!!!

http://reviews.llvm.org/D10305

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/