[cfe-dev] [RFC] cHash: Integration of AST hashing

Tue Dec 12 00:24:27 PST 2017

Richard Trieu <rtrieu at google.com> writes:

> On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <
> dietrich at sra.uni-hannover.de> wrote:
>
>> Richard Trieu <rtrieu at google.com> writes:
>>
>> > I don't understand your need for pseudrandom identifiers.  If a specific
>> > type of AST node (for instance, a Decl) is expected, then adding to the
>> > stream the node kind (like from Decl::getKind()) is enough to avoid
>> > collisions.  If any node can be expected, then first pass in an enum to
>> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
>> > the node kind, and then continue on.  Stmt::Profile and ODRHash use this
>> > method to avoid collisions.  I don't see any gain from using new
>> > identifiers here.
>>
>> Ok, so let's explain a little bit my idea here: Of course, including the
>> Decl::getKind() also distinguishes node types from each other. However,
>> these values come from enums and are, therefore, small integers. Small
>> integers (SI) occur very often in an AST. So, if a sequence of addData()
>> operations produces the same sequence of SIs because node types are not
>> distinguishable from SI that occur in the AST, we get a collision[1].
>>
> How does the result of Decl::getKind() being a small integer allows any
> chance of collision?  I'll admit that arbitrary sub-sequences of different
> hash streams can be equal, but we shouldn't be hashing sub-sequences.

Sure, we shouldn't take an arbitrary subsequence of the stream and hash
it. That would be a bad idea.

> Decl
> <Decl::getKind()>
>
> Decl with two fields
> <Decl::getKind()> <Field 1> < Field 2>
>
> Decl with one sub-Decl
> <Decl::getKind()> <Stream of sub-decl>
>
> Decl with arbitrary number of sub-Decl's
> <Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
> of sub-decl N>
>
> The result of Decl::getKind() will basically specify the format of the hash
> stream.  I think better controlling of the input stream is a better way of
> avoiding collisions then using magic numbers as identifiers.
>
> We could see the same sequence for different ASTs, if an AST node has a

Sure, this would definitly be a better option. A quick look into the
RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
So we have to add them.

Perhaps we can combine both, length-indicators and and
non-small-integers, by using llvm::hash on the <Decl::getKind()> values.
As the documentation says, we have to spent less than 20 cycles per
32-bit value on this. By this, we avoid the magic numbers. I will start
on doing the changes to incorporate the length fields.

Furthermore, I'm not sure how the clang community handles the "how to
find a reviewer for my patch". Are people adding themselves to be a
reviewer? Or do I just add people how I think could be interessted in
reviewing the change?

chris
-- 
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de