[cfe-dev] [RFC] cHash: Integration of AST hashing

Mon Aug 7 01:05:45 PDT 2017

Raphael Isemann <teemperor at gmail.com> writes:

> That sounds like a really nice tool!

I also really like the general principle of avoiding work that is
redundant in the end. Let's save some polar bears! And some developers
from serious injuries stemming from sword fights.

> One problem I see is that we get yet another AST hashing
> implementation in clang. I'm aware of at least four of them that I
> personally work with:
>
> * The normal Stmt::Profile for Stmts

This does indeed look structurally similar to our implementation for
statements. However, types are only included as means of their pointers
and, therefore, the hash is not stable over compilations. I wonder for
which applications this is actually sufficient. And, does it really
speed up things to badly?

> * The ODR hashing for AST nodes in modules.

This is also based on StmtProfilerWithoutPointers as far as I can
tell[1]. And it is stable over invocations as it actually handles types.
However, changing the order of declarations[1] will alter the hash. Also
one thing we did in cHash is to use pseudrandom identifiers for
different ast node types[3] to avoid some of the possible collisions. If
you only add small integer values (getKind(), etc.) It is much easier to
have a collision. Therefore, we add something like

<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>

to the hashing stream most of the time. Ideally, the text that goes into
the hash should be usable to reconstruct an AST tree with the same
structure; not neccessarily with the same information.

> * The ASTDiff also has one

Ok, then I was unable to find it. Can you give me a pointer

> * The CloneDetector with its StmtDataCollector which can be
> specialized for hashing
>
> They all contain similar code and have similar goals. And they all
> contain different kind of bugs. And that's just the implementations
> that I'm aware of.

So, when they all contain some kinds of quirks, we really should build
a good test harness around a unified implementation. Since nobody wants
to break the "give me a unique identifier" silently.

> I would very much prefer if we don't add more redundant code here but
> instead try to figure out if we can get all this code in a more
> central place.

I like that idea and I do not necessarily stick to our implementation.

> Most of the use cases are just about
>
>> I have this node, please forward all interesting data into this
>> stream so I can [store|hash|...] it and I [do|don't] care about
>> having cross-TU stable data and maybe I want to treat special nodes
>> in a different way
>
> so it's not necessarily a difficult problem to solve. If we find a
> common baseline for this, we could make one visitor that people can
> specialize their specific hashing needs on top.

Your are right, it is not a difficult problem in the end. But when doing
It for the whole AST there are some weird corner cases that should be
handled correctly (recursive type definitions come to my mind)

> I already started doing this for the Stmt hashing when I refactored
> the StmtDataCollector. I set a common baseline that I could turn my
> two hashing implementations back then into a single one and I'm aiming
> at also replacing the Stmt::Profile implementations with it.
>
> Not sure if anyone started working on something similar for generic
> AST nodes (which is probably a bit more complicated than the Stmts
> which are relatively simple), but maybe we should see if we can get
> some people together to work on that. Otherwise we will have a dozen
> hashing implementations in a few years that no one can/will maintain.

Bringing cHash to mainline is based on the wish that I don't want to
maintain an AST hashing mechanism in parallel to the clang development.
Therefore, I would love to help with that. We should collect some
requirements that the different users of unique identifiers have. And
perhaps we can make them all happy by introducing a few switches to a
unified visitor.

Some switches and requirements come directly to my mind:

- Syntactical vs. Semantic: There should be a switch to include only
  nodes that are actually syntacical children of the given node. Not
  everybody wants to include the transitive hull over all referenced AST
  nodes.

- Multiple Hashes in one pass: With one pass over the AST I want to have
  not only one top-level hash, but also hashes of some interesting nodes
  (top-level definitions, type declarations, global variables). cHash
  already caches hashes for some nodes in a map to speed things up.

- Debug information: I want to turn on/off the inclusion of debugging
  information (line numbers, file names, local identifier names).

- Stable?: Should the hash be stable over compiler invocations? Or can
  we take some shortcuts for types/declarations/things that are not
  syntactially enclosed into the node.

chris
-- 
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de