[cfe-dev] [RFC] cHash: Integration of AST hashing

Mon Oct 23 07:00:26 PDT 2017

First of all, I'm sorry for the long delay in my answer. There was
vacation and an awful lot of traveling in August and September.

Johannes Altmanninger <aclopte at gmail.com> writes:

>> <UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3>
>
> We should probably do the same, maybe also for other enum members.
>
> In addition, it may be necessary to add some padding or separator
> between when hashing multiple nodes as they have different numbers of
> fields, I am not an expert though.

I would see two basic cases:

1. The number of fields is constant -> The knowledge is derivable from
   the UNIQUE-NODE-CLASS-ID.
2. Variable number of fields (or children): Just hash in a 4 byte
   integer.

>> - Stable?: Should the hash be stable over compiler invocations? Or can
>>   we take some shortcuts for types/declarations/things that are not
>>   syntactially enclosed into the node.
>
> Hopefully it is not too difficult to make it stable.
>
> If there is a node that is referred to by its name, then it suffices to hash the
> qualified name, provided that the referenced node is also included in
> the hash.

I see one problem here, at least if we want to make a syntactial hash,
and I believe that we have to include not only the qualified name, but
the hash behind that name.

I think a problem arises, if we make one pass over the whole tree
(starting from a TranslationUnit) and record not only the top-level
hash, but also every function level hash. Let's assume the following two
scenarios, where we expect both function hashes of foo to be equal:

: struct bar { int x; };
:
: void foo(struct bar y) {};

hash(translation-unit) = hash(foo) + hash(struct bar)
  where
      hash(foo) = hash(void) + hash(foo)
                  + hash("struct bar") + hash(y) + hash({})
      hash(struct bar) = ....

: struct bar { long x; };
: 
: void foo(struct bar y) {};

hash(translation-unit) = hash(foo) + hash(baz) + hash(struct bar)
  where
      hash(foo) = hash(void) + hash(foo)
                  + hash("struct bar") + hash(y) + hash({})
      hash(struct bar) = ....

In this case, the top-level hash changes, as struct bar changes, but the
hash for 'foo' does not change. This is OK for a syntactic hash, but not
for a semantic hash. 

> So there are references to other nodes by name.
> All other references between nodes I can think of are based on the tree
> structure (correct me if I am wrong), so we should be fine if we hash
> subtrees.

It depends. You can have an expression that references a type without
explicitly naming it:

: struct bar { int x }
: struct foo { struct bar * BAR; }
:
: int foo() {
:   struct foo FOO;
:   FOO.BAR.x = 23;
:   return FOO.BAR.x;
: }
:
: int barfoo() {
:    struct *foo FOO = mallow(23);
: }

Here, we see that that:

  a) foo depends on a type it does not name explicitly, but only
     implicitly, by using foo.
  b) The binary of barfoo does NOT change, if 'struct bar' changes.

Therefore, I believe it is necessary to include all type references with
their hash for the semantic hash.

As I see, you have done a lot of work on the StmtDataCollectors and used
TableGen to generate the DataCollector. So my idea would be the following:

- Extend the StmtDataCollectors.td to handle all possible nodes/types in
  the clang AST. Invoking this operator once, we get the hash value for
  one node.

- Derive an recursive AST hashing visitor from RecursiveASTVisitor. This
  recursive AST visitor then has our desired flags:

  - semantic_hash vs syntactic_hash
  - Hash mechanism as a template parameter
  - Include debug flags

One problem I have with RecursiveASTVisitor is either preorder, or post
order. However, to calculate an Hash for let's say a type, I have to:

    PreVisitType: H = new Hash
   VisitFuncType: H.addData(FUNC_TYPE)
                  hash(T.getReturnType())
                     -> Recursion
   PostVisitType: Store Hash for function

And I have not seen, how I could do this with RecursiveASTVisitor

chris
-- 
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de