[PATCH] [Clang Static Analyzer] Bug identification

Wed Jun 17 09:13:36 PDT 2015

> Should we include redundancy in the hash? The file name is already part of the report. The current approach avoids redundancy but has a post processing step, where the true identifier is computed. See "getIssueIdentifier" of clang/utils/analyzer/CmpRuns.py.

This is not redundant, and I think we should include the file name in the hash, otherwise bugs in copy pasted code-parts, but in different files will have the same ID.

> Have you seen PGOHash (./lib/CodeGen/CodeGenPGO.cpp), which creates a hash of the AST? Having a more semantic approach would be more robust. For example, it could protect from variable rename and other refactoring.

Good idea. We should try this out.

> I believe these should be replaced with a bug identifier. The need for #6 highlights that the checker name is not sufficient here. A bug identifier is the most logical match for this info.

This field is meant to differentiate between multiple reports by the exact same checker at the same position (line, column). So in this case only the checker writer can add additional information that can be a differentiator.

> What happens for bugs reported in template code?

The template code affects only the signature of enclosing scope part in the bugid calculation.
The concrete type of template parameters will be included in the signature of the enclosing scope for each different template instantiation.
For example:

  template <typename T>
  void foo(T x) {
  	int a = 2 / 0; // division by zero
  }

  int main() {
  	foo<int>(2);  // bugid part #5: void foo(int)
  	foo<char>('a'); // bugid part #5: void foo(char)
  }

The current version reports two different bugs.
Should I change to the current implementation to use only the name of the template parameters? For example, T in the previous example.

> I would be interested in either replacing "issue_hash" or adding "issue_hash_bug_line_content" (or something like it) instead of adding another completely differently named field with very similar information. I see no reason for having both. I am not sure if we have any users of "issue_hash" right now, who will suffer from the change. Maybe we could have "issue_hash", "issue_hash_1"(offset based), and "issue_hash_2"(content of line) and add another field "issue_hash_version" that describes the version "issue_hash" is using?

We could call it issue_hash_2, no problem.

http://reviews.llvm.org/D10305

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/