[cfe-dev] [RFC] cHash: Integration of AST hashing

Christian Dietrich via cfe-dev cfe-dev at lists.llvm.org
Thu Dec 14 04:04:46 PST 2017

Richard Trieu <rtrieu at google.com> writes:

>  People will add themselves as reviewers, so you can CC people that you
> think may have an interest in it.  If neither happens, adding a code owner
> and they will help find reviewers.
> Personally, I've only looked through small portions of your patch partially
> because I have other projects to work on and partially because I was
> waiting on your answers to Richard Smith's queries.  Basically, your plan
> for integrating with the existing hash functions and how the other hash
> functions, especially ODRHash, doesn't fill your need.  From what I gather,
> you think the ODRHash is too strict on the AST structure.  I would like to
> understand what differences in the AST that you are fine ignoring.

As, we have discussed in this thread, there are different requirements
for hashing an AST.

(Syntactic <-> Semantic) Hash: One would be to to a syntactic hash,
where only the syntactical children of a node are included into the
hash. Our cHash approach uses a semantical hash, which means that the
hash changes, when the meaning of a element (i.e. function) changes.
Such a change can, for example, happen if a type that is used in the
current element changes. So the cHash Visitor follows the cross-tree
references. As far as I understand it, the hash used by Johannes' clone
detection does not follow these links.

(Compiler Abort on cHash) When given a whole TranslationUnitDecl, the
CHashVisitor-hash-value remains the same, if the resulting binary code
will also be the same. We ignore changes to declarations and type, if
they are not used in any definition. But not all users of a AST Hashing
will want this, therefore, a AST Hashing mechanism should be adaptable.

(DataCollector) Furthermore, the AST hash implementation should be
adaptable to use different data collectors (our Clang Plugin uses a
MurMur3 Hash for performance reasons). 

With ODRHash/Stmt::Profile I see the the problem that it is very
monolitic. It implements its own version of an AST traversal and I
really do not understand why. Using the RecursiveASTVisitor is so much
more convenient and maintainable. So I think that basing
ODRHash/Stmt::Profile on this TableGen-based mechanism would increase
its maintainability. As I already mentioned, we have also implemented
our own version of an AST Visitor in the past. And switching to the
RecursiveASTVisitor saved around 1200 lines of boilerplate code.

I would suggest, that the differences between the current state of the
.td files is improved by looking through ODRHash/Stmt::Profile and we
base the later one on the RecursiveASTVisitor and the
*DataCollectors.td. Furthermore, I would like to integrate large parts
of our test-suite into the AST unit tests to give it a rigerous testing.

Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de

More information about the cfe-dev mailing list