[cfe-dev] [RFC] cHash: Integration of AST hashing
Richard Trieu via cfe-dev
cfe-dev at lists.llvm.org
Mon Dec 11 11:52:26 PST 2017
On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <
dietrich at sra.uni-hannover.de> wrote:
> Richard Trieu <rtrieu at google.com> writes:
>
> > I don't understand your need for pseudrandom identifiers. If a specific
> > type of AST node (for instance, a Decl) is expected, then adding to the
> > stream the node kind (like from Decl::getKind()) is enough to avoid
> > collisions. If any node can be expected, then first pass in an enum to
> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
> > the node kind, and then continue on. Stmt::Profile and ODRHash use this
> > method to avoid collisions. I don't see any gain from using new
> > identifiers here.
>
> Ok, so let's explain a little bit my idea here: Of course, including the
> Decl::getKind() also distinguishes node types from each other. However,
> these values come from enums and are, therefore, small integers. Small
> integers (SI) occur very often in an AST. So, if a sequence of addData()
> operations produces the same sequence of SIs because node types are not
> distinguishable from SI that occur in the AST, we get a collision[1].
>
> How does the result of Decl::getKind() being a small integer allows any
chance of collision? I'll admit that arbitrary sub-sequences of different
hash streams can be equal, but we shouldn't be hashing sub-sequences.
Decl
<Decl::getKind()>
Decl with two fields
<Decl::getKind()> <Field 1> < Field 2>
Decl with one sub-Decl
<Decl::getKind()> <Stream of sub-decl>
Decl with arbitrary number of sub-Decl's
<Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
of sub-decl N>
The result of Decl::getKind() will basically specify the format of the hash
stream. I think better controlling of the input stream is a better way of
avoiding collisions then using magic numbers as identifiers.
We could see the same sequence for different ASTs, if an AST node has a
> variable number of children/references. Especially when using the
> RecursiveASTVisitor, which reduces the implementation burden of such a
> hashing method enormously[2], I'm not 100% sure, that I find all node
> types with a variable number of children/references.
>
Adding the number of children before visiting each child would avoid a
collision.
>
> Introducing 32 Bit of random as an node-type identifier makes the
> propability for such a constructive collision much smaller. So it is a
> better-safe-than-sorry approach at this point. I also think the enum in
> CHashVisitor.h is very very ugly.
>
> chris
>
> [1] Sure, we use hashing, so there is always the possibility of hash
> collisions, but we should at least avoid to feed the hash algorithm
> the same sequence for different AST trees.
>
> [2] By porting[3] CHash to the RecursiveASTVisitor saved about 1200
> lines of code.
>
> [3] More like throwing away, and building from scratch by using
> fragments of the original code.
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel: +49 511 762-19737
> Fax: +49 511 762-19733
> eMail: dietrich at sra.uni-hannover.de
> WWW: https://www.sra.uni-hannover.de
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171211/dc5b6723/attachment.html>
More information about the cfe-dev
mailing list