[cfe-dev] [RFC] cHash: Integration of AST hashing
Christian Dietrich via cfe-dev
cfe-dev at lists.llvm.org
Tue Dec 12 00:24:27 PST 2017
Richard Trieu <rtrieu at google.com> writes:
> On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <
> dietrich at sra.uni-hannover.de> wrote:
>> Richard Trieu <rtrieu at google.com> writes:
>> > I don't understand your need for pseudrandom identifiers. If a specific
>> > type of AST node (for instance, a Decl) is expected, then adding to the
>> > stream the node kind (like from Decl::getKind()) is enough to avoid
>> > collisions. If any node can be expected, then first pass in an enum to
>> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add
>> > the node kind, and then continue on. Stmt::Profile and ODRHash use this
>> > method to avoid collisions. I don't see any gain from using new
>> > identifiers here.
>> Ok, so let's explain a little bit my idea here: Of course, including the
>> Decl::getKind() also distinguishes node types from each other. However,
>> these values come from enums and are, therefore, small integers. Small
>> integers (SI) occur very often in an AST. So, if a sequence of addData()
>> operations produces the same sequence of SIs because node types are not
>> distinguishable from SI that occur in the AST, we get a collision.
> How does the result of Decl::getKind() being a small integer allows any
> chance of collision? I'll admit that arbitrary sub-sequences of different
> hash streams can be equal, but we shouldn't be hashing sub-sequences.
Sure, we shouldn't take an arbitrary subsequence of the stream and hash
it. That would be a bad idea.
> Decl with two fields
> <Decl::getKind()> <Field 1> < Field 2>
> Decl with one sub-Decl
> <Decl::getKind()> <Stream of sub-decl>
> Decl with arbitrary number of sub-Decl's
> <Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream
> of sub-decl N>
> The result of Decl::getKind() will basically specify the format of the hash
> stream. I think better controlling of the input stream is a better way of
> avoiding collisions then using magic numbers as identifiers.
> We could see the same sequence for different ASTs, if an AST node has a
Sure, this would definitly be a better option. A quick look into the
RecursiveASTVisitor tells me that there are at least 76 for (...) loops.
So we have to add them.
Perhaps we can combine both, length-indicators and and
non-small-integers, by using llvm::hash on the <Decl::getKind()> values.
As the documentation says, we have to spent less than 20 cycles per
32-bit value on this. By this, we avoid the magic numbers. I will start
on doing the changes to incorporate the length fields.
Furthermore, I'm not sure how the clang community handles the "how to
find a reviewer for my patch". Are people adding themselves to be a
reviewer? Or do I just add people how I think could be interessted in
reviewing the change?
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
30167 Hannover, Germany
Tel: +49 511 762-19737
Fax: +49 511 762-19733
eMail: dietrich at sra.uni-hannover.de
More information about the cfe-dev