[cfe-dev] [RFC] cHash: Integration of AST hashing

Christian Dietrich via cfe-dev cfe-dev at lists.llvm.org
Mon Dec 11 05:14:08 PST 2017


Richard Trieu <rtrieu at google.com> writes:

> I don't understand your need for pseudrandom identifiers.  If a specific
> type of AST node (for instance, a Decl) is expected, then adding to the
> stream the node kind (like from Decl::getKind()) is enough to avoid
> collisions.  If any node can be expected, then first pass in an enum to
> differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
> the node kind, and then continue on.  Stmt::Profile and ODRHash use this
> method to avoid collisions.  I don't see any gain from using new
> identifiers here.

Ok, so let's explain a little bit my idea here: Of course, including the
Decl::getKind() also distinguishes node types from each other. However,
these values come from enums and are, therefore, small integers. Small
integers (SI) occur very often in an AST. So, if a sequence of addData()
operations produces the same sequence of SIs because node types are not
distinguishable from SI that occur in the AST, we get a collision[1].

We could see the same sequence for different ASTs, if an AST node has a
variable number of children/references. Especially when using the
RecursiveASTVisitor, which reduces the implementation burden of such a
hashing method enormously[2], I'm not 100% sure, that I find all node
types with a variable number of children/references.

Introducing 32 Bit of random as an node-type identifier makes the
propability for such a constructive collision much smaller. So it is a
better-safe-than-sorry approach at this point. I also think the enum in
CHashVisitor.h is very very ugly.

chris

[1] Sure, we use hashing, so there is always the possibility of hash
    collisions, but we should at least avoid to feed the hash algorithm
    the same sequence for different AST trees.

[2] By porting[3] CHash to the RecursiveASTVisitor saved about 1200
    lines of code.

[3] More like throwing away, and building from scratch by using
    fragments of the original code.
-- 
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de



More information about the cfe-dev mailing list