[cfe-dev] [RFC] cHash: Integration of AST hashing

Christian Dietrich via cfe-dev cfe-dev at lists.llvm.org
Mon Dec 11 05:14:08 PST 2017

Richard Trieu <rtrieu at google.com> writes:

> I don't understand your need for pseudrandom identifiers.  If a specific
> type of AST node (for instance, a Decl) is expected, then adding to the
> stream the node kind (like from Decl::getKind()) is enough to avoid
> collisions.  If any node can be expected, then first pass in an enum to
> differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
> the node kind, and then continue on.  Stmt::Profile and ODRHash use this
> method to avoid collisions.  I don't see any gain from using new
> identifiers here.

Ok, so let's explain a little bit my idea here: Of course, including the
Decl::getKind() also distinguishes node types from each other. However,
these values come from enums and are, therefore, small integers. Small
integers (SI) occur very often in an AST. So, if a sequence of addData()
operations produces the same sequence of SIs because node types are not
distinguishable from SI that occur in the AST, we get a collision[1].

We could see the same sequence for different ASTs, if an AST node has a
variable number of children/references. Especially when using the
RecursiveASTVisitor, which reduces the implementation burden of such a
hashing method enormously[2], I'm not 100% sure, that I find all node
types with a variable number of children/references.

Introducing 32 Bit of random as an node-type identifier makes the
propability for such a constructive collision much smaller. So it is a
better-safe-than-sorry approach at this point. I also think the enum in
CHashVisitor.h is very very ugly.


[1] Sure, we use hashing, so there is always the possibility of hash
    collisions, but we should at least avoid to feed the hash algorithm
    the same sequence for different AST trees.

[2] By porting[3] CHash to the RecursiveASTVisitor saved about 1200
    lines of code.

[3] More like throwing away, and building from scratch by using
    fragments of the original code.
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de

More information about the cfe-dev mailing list