[PATCH] [Clang Static Analyzer] Bug identification
Anna Zaks
zaks.anna at gmail.com
Mon Jun 15 14:07:30 PDT 2015
Thank you for working on this; this is a long needed improvement! Here are some high level comments.
> 1. File name
Should we include redundancy in the hash? The file name is already part of the report. The current approach avoids redundancy but has a post processing step, where the true identifier is computed. See "getIssueIdentifier" of clang/utils/analyzer/CmpRuns.py.
> 2. Content of the line where the bug is
>
> • So if anything changes in the close environment of the bug, it changes the ID. We think that it is likely that the changes in the same line will semantically >affect the bug.
Close environment of the bug could potentially include other nodes on the diagnostic path. Have you considered including those as well? (This is food for thought, not a blocker for this patch.)
Have you seen PGOHash (./lib/CodeGen/CodeGenPGO.cpp), which creates a hash of the AST? Having a more semantic approach would be more robust. For example, it could protect from variable rename and other refactoring.
> 4. Unique name of the checker
> 5. Optional Extra field
I believe these should be replaced with a bug identifier. The need for #6 highlights that the checker name is not sufficient here. A bug identifier is the most logical match for this info.
> Due to overloaded functions and copy pasted implementations, it is likely that the same fault is found in two different overloaded functions.
What happens for bugs reported in template code?
> In the current code there exists a similar identifier generator to the one suggested above. That implementation takes into consideration only
> • the name of the enclosing scope
> • and the relative line number within the enclosing scope.
> This source of information is insufficient as the base of the hash for the reasons described above.
> We included a version of the hash in the name of the key (<key>bug_id_1</key>) in the PLIST output to identify the hash generator algorithm. This way it will >be possible to introduce a new hash calculation algorithm if needed.
I would be interested in either replacing "issue_hash" or adding "issue_hash_bug_line_content" (or something like it) instead of adding another completely differently named field with very similar information. I see no reason for having both. I am not sure if we have any users of "issue_hash" right now, who will suffer from the change. Maybe we could have "issue_hash", "issue_hash_1"(offset based), and "issue_hash_2"(content of line) and add another field "issue_hash_version" that describes the version "issue_hash" is using?
This needs tests!!!
http://reviews.llvm.org/D10305
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
More information about the cfe-commits
mailing list