[cfe-dev] [RFC] cHash: Integration of AST hashing

Mon Dec 11 11:52:26 PST 2017

On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <
dietrich at sra.uni-hannover.de> wrote:

> Richard Trieu <rtrieu at google.com> writes:
>
> > I don't understand your need for pseudrandom identifiers.  If a specific
> > type of AST node (for instance, a Decl) is expected, then adding to the
> > stream the node kind (like from Decl::getKind()) is enough to avoid
> > collisions.  If any node can be expected, then first pass in an enum to
> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
> > the node kind, and then continue on.  Stmt::Profile and ODRHash use this
> > method to avoid collisions.  I don't see any gain from using new
> > identifiers here.
>
> Ok, so let's explain a little bit my idea here: Of course, including the
> Decl::getKind() also distinguishes node types from each other. However,
> these values come from enums and are, therefore, small integers. Small
> integers (SI) occur very often in an AST. So, if a sequence of addData()
> operations produces the same sequence of SIs because node types are not
> distinguishable from SI that occur in the AST, we get a collision[1].
>
> How does the result of Decl::getKind() being a small integer allows any
chance of collision?  I'll admit that arbitrary sub-sequences of different
hash streams can be equal, but we shouldn't be hashing sub-sequences.

Decl
<Decl::getKind()>

Decl with two fields
<Decl::getKind()> <Field 1> < Field 2>

Decl with one sub-Decl
<Decl::getKind()> <Stream of sub-decl>

Decl with arbitrary number of sub-Decl's
<Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
of sub-decl N>

The result of Decl::getKind() will basically specify the format of the hash
stream.  I think better controlling of the input stream is a better way of
avoiding collisions then using magic numbers as identifiers.

We could see the same sequence for different ASTs, if an AST node has a
> variable number of children/references. Especially when using the
> RecursiveASTVisitor, which reduces the implementation burden of such a
> hashing method enormously[2], I'm not 100% sure, that I find all node
> types with a variable number of children/references.
>

Adding the number of children before visiting each child would avoid a
collision.

>
> Introducing 32 Bit of random as an node-type identifier makes the
> propability for such a constructive collision much smaller. So it is a
> better-safe-than-sorry approach at this point. I also think the enum in
> CHashVisitor.h is very very ugly.
>
> chris
>
> [1] Sure, we use hashing, so there is always the possibility of hash
>     collisions, but we should at least avoid to feed the hash algorithm
>     the same sequence for different AST trees.
>
> [2] By porting[3] CHash to the RecursiveASTVisitor saved about 1200
>     lines of code.
>
> [3] More like throwing away, and building from scratch by using
>     fragments of the original code.
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  dietrich at sra.uni-hannover.de
> WWW:    https://www.sra.uni-hannover.de
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171211/dc5b6723/attachment.html>