[cfe-dev] [RFC] cHash: Integration of AST hashing

Mon Aug 14 03:09:23 PDT 2017

Christian Dietrich via cfe-dev <cfe-dev at lists.llvm.org> writes:

> Raphael Isemann <teemperor at gmail.com> writes:
>
>> That sounds like a really nice tool!
>
> I also really like the general principle of avoiding work that is
> redundant in the end. Let's save some polar bears! And some developers
> from serious injuries stemming from sword fights.
>
>> One problem I see is that we get yet another AST hashing
>> implementation in clang. I'm aware of at least four of them that I
>> personally work with:
>>
>> * The normal Stmt::Profile for Stmts
>
> This does indeed look structurally similar to our implementation for
> statements. However, types are only included as means of their pointers
> and, therefore, the hash is not stable over compilations. I wonder for
> which applications this is actually sufficient. And, does it really
> speed up things to badly?
>
>> * The ODR hashing for AST nodes in modules.
>
> This is also based on StmtProfilerWithoutPointers as far as I can
> tell[1]. And it is stable over invocations as it actually handles types.
> However, changing the order of declarations[1] will alter the hash. Also
> one thing we did in cHash is to use pseudrandom identifiers for
> different ast node types[3] to avoid some of the possible collisions. If
> you only add small integer values (getKind(), etc.) It is much easier to
> have a collision. Therefore, we add something like
>
> <UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

We should probably do the same, maybe also for other enum members.

In addition, it may be necessary to add some padding or separator
between when hashing multiple nodes as they have different numbers of
fields, I am not an expert though.

>
> to the hashing stream most of the time. Ideally, the text that goes into
> the hash should be usable to reconstruct an AST tree with the same
> structure; not neccessarily with the same information.
>
>> * The ASTDiff also has one
>
> Ok, then I was unable to find it. Can you give me a pointer
>
>> * The CloneDetector with its StmtDataCollector which can be
>> specialized for hashing
>>
>> They all contain similar code and have similar goals. And they all
>> contain different kind of bugs. And that's just the implementations
>> that I'm aware of.
>
> So, when they all contain some kinds of quirks, we really should build
> a good test harness around a unified implementation. Since nobody wants
> to break the "give me a unique identifier" silently.
>
>> I would very much prefer if we don't add more redundant code here but
>> instead try to figure out if we can get all this code in a more
>> central place.
>
> I like that idea and I do not necessarily stick to our implementation.
>
>> Most of the use cases are just about
>>
>>> I have this node, please forward all interesting data into this
>>> stream so I can [store|hash|...] it and I [do|don't] care about
>>> having cross-TU stable data and maybe I want to treat special nodes
>>> in a different way
>>
>> so it's not necessarily a difficult problem to solve. If we find a
>> common baseline for this, we could make one visitor that people can
>> specialize their specific hashing needs on top.
>
> Your are right, it is not a difficult problem in the end. But when doing
> It for the whole AST there are some weird corner cases that should be
> handled correctly (recursive type definitions come to my mind)
>
>> I already started doing this for the Stmt hashing when I refactored
>> the StmtDataCollector. I set a common baseline that I could turn my
>> two hashing implementations back then into a single one and I'm aiming
>> at also replacing the Stmt::Profile implementations with it.
>>
>> Not sure if anyone started working on something similar for generic
>> AST nodes (which is probably a bit more complicated than the Stmts
>> which are relatively simple), but maybe we should see if we can get
>> some people together to work on that. Otherwise we will have a dozen
>> hashing implementations in a few years that no one can/will maintain.
>
> Bringing cHash to mainline is based on the wish that I don't want to
> maintain an AST hashing mechanism in parallel to the clang development.
> Therefore, I would love to help with that. We should collect some
> requirements that the different users of unique identifiers have. And
> perhaps we can make them all happy by introducing a few switches to a
> unified visitor.

Yes, it would be good to identify the different kinds of uniqueness that
people need.

>
> Some switches and requirements come directly to my mind:
>
> - Syntactical vs. Semantic: There should be a switch to include only
>   nodes that are actually syntacical children of the given node. Not
>   everybody wants to include the transitive hull over all referenced AST
>   nodes.

Sounds good, so the syntactical one discards any references to nodes
save for its children. I can use this in ASTDiff for checking whether
two matched nodes differ.

>
> - Multiple Hashes in one pass: With one pass over the AST I want to have
>   not only one top-level hash, but also hashes of some interesting nodes
>   (top-level definitions, type declarations, global variables). cHash
>   already caches hashes for some nodes in a map to speed things up.
>
> - Debug information: I want to turn on/off the inclusion of debugging
>   information (line numbers, file names, local identifier names).
>

These two should be easy to do, I think we definitely want a system that can
have user-provided functions for each node class.

> - Stable?: Should the hash be stable over compiler invocations? Or can
>   we take some shortcuts for types/declarations/things that are not
>   syntactially enclosed into the node.

Hopefully it is not too difficult to make it stable.

If there is a node that is referred to by its name, then it suffices to hash the
qualified name, provided that the referenced node is also included in
the hash.

So there are references to other nodes by name.
All other references between nodes I can think of are based on the tree
structure (correct me if I am wrong), so we should be fine if we hash
subtrees.

Johannes

>
> chris
> -- 
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  dietrich at sra.uni-hannover.de
> WWW:    https://www.sra.uni-hannover.de
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev