[cfe-dev] [RFC] cHash: Integration of AST hashing

Mon Mar 12 08:39:21 PDT 2018

So first, I want to make some points clear:

- The patch extends the already existing DataCollector interface to
  support more more AST node types. This extension tries to accomplish
  the principle to include only data that is local to that node[1]. So
  here, I do not introduce a new hashing infrastructure, but merely
  extend an exisiting infrastructure that I found in LLVM.

- The CHashVisitor only is a user of the provided infrastructure. It
  provides a semantic[2] hash for C.  At the moment the
  CHashVisitor does not cover all C++ AST nodes. C++ is the next step,
  after merging this change.

@Regarding Tests: There is an extensive test suite for CHash[3] in our external
 repository. I only included a very rudimentary unittests into this
 patch. After we resolved the discussion about the best principle how to
 handle AST Hashing in clang, I will integrate and consolidate our tests
 into CHashTest.cpp.

@New Visit Functions in CHashVisitor: The CHashVisitor overrides some
  Visit* functions from the DataCollector infrastructure to include
  information form cross-tree references to AST nodes (like types) to
  accomplish its goal of an semantic hash. By overriding these in the
  user of the DataCollector, the central DataCollector infrastructure
  can remain "node local" while CHashVisitor includes links to other
  nodes that can influence the semantic of the AST node.

  By this separation of concers, a family of AST hashers should be
  impementable without reinventing the wheel over and over again.

Richard Trieu via cfe-dev <cfe-dev at lists.llvm.org> writes:

> I posted some questions to the phab site about the design of the hasher.  I
> suggest you try your hasher on clang/test/Modules/odr_hash.h with the flag
> -DFIRST.  I put many of the issues I encountered into there.

Thank you for that pointer. The file actually revealed some problem
(read stackoverflows) that the current CHashVisitor included for CXX
code. This included that CXXThisExpr references to the enclosing
CXXRecord and that NamedDecl::getName() cannot be called on all CXX
nodes. However, now CHashVisitor can process the odr_hash.h.

> I'm still iffy on how basing all the hashers on *Collectors.td would work.
> Is there enough flexibility for changes each hasher needs that it wouldn't
> result in a lot of custom code anyways?

But wouldn't that be an argument to include more hashers with totally
different implementations? Howver, even if there are custom changes for
different hashers, basing them on TableGen and RecursiveASTVisitor is
probably the best route to handle the issue. Especially, if we consider
including something like D40781.

Regarding ODR Hash: Can you describe the semantic of the ODR hashing in
a few sentences? Under what circumstance does the hash change? What is
the high-level principle? Under what circumstances is a hash collision
possible.

Vassil Vassilev on Phabricator wrote:

> I am not sure if this is the right time but maybe testing and comparing
> the behavior of this patch against something such as the ODRHash will
> help you, me (and probably others) to understand the current patch
> better.

I think that ODRHash.cpp and StmtProfile.cpp include very valuable
knowledge what should should be included into an AST hash and how to
handle edge cases. Especially when it comes to fully handle C++ nodes.

However, CHashVisitor does not try to replace ODRHash/StmtProfiler, it
has its own purpose. The goal of this patch is to start consolidating
these mechanisms. For this consolidation, I took the time to rewrite
CHash to be based on RecursiveASTVisitor and DataCollector. As a result,
the CHashVisitor dropped down to 320 lines of custom code to establish
cross-tree links.

> In https://reviews.llvm.org/D41416, [...], essentially I am comparing
> the behavior of the ODRHash and the Stmt::Profile logic for some
> cases.
>
> Perhaps you can plugin your algorithm to it, create a pch and start
> comparing your logic with the when it comes to template arguments
> hashing and alike.
>
> Of course this is not something that will give you perfect coverage
> and so on but it can stress test the implementation and compare it to
> the 2 major existing ones in clang.

For C code, we already stress tested CHash with 2300 commits from musl
and the CHash AST hash always changed when the object file changed (no
false negative). Furthermore, we compiled 500 commits from lua, mbedTLS,
musl, bash, CPython and PostgreSQL. So at least for C code I'm very
confident that CHashVisitor handles a lot of code in the wild.

chris

[1] This principle breaks at the "// CrossRef" Sections. Since there I
   call addData() on range<> and ArrayRef<> types to give the user of
   the DataCollector the chance to include the number of children into
   its hash.

[2] I refer you to the previous messages from this discussion and to the
    Usenix paper for the term "semantic hash".

[3] https://github.com/luhsra/chash/tree/master/test
-- 
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de