[cfe-dev] [RFC] cHash: Integration of AST hashing

Fri Mar 16 19:28:39 PDT 2018

The original StmtDataCollector was made for StmtVisitor, not
RecursiveASTVisitor.  There's a difference in how the function work that
needs to be noted.

I don't like using the RecursiveASTVisitor in this form.  It was designed
so that users can define custom Visit* functions to occur when an AST node
is traversed.  The most evident part of this is that cHash needed to
include sizes of various containers far from the point where they are
iterated over.  The size being added is decouped from where it's used, and
can be inaccurate if nodes are skipped.  I believe you picked this way
because of how much code you saved, but it also hides away how each part of
the AST is treated.

You are also using printing on some nodes.  Although it works, printing
should be avoided.

Another point I'd like to bring up is about the hashing itself.  The method
you are using is to push individual values to the hash.  Other places use a
vector to store values and only pushing it to hash as a large chunk when
finished.  I'm not an expert in hashing, but I've thought that hashing
would work faster on a large chunk than spread out over smaller values.

On Mon, Mar 12, 2018 at 8:39 AM, Christian Dietrich <
dietrich at sra.uni-hannover.de> wrote:

> So first, I want to make some points clear:
>
> - The patch extends the already existing DataCollector interface to
>   support more more AST node types. This extension tries to accomplish
>   the principle to include only data that is local to that node[1]. So
>   here, I do not introduce a new hashing infrastructure, but merely
>   extend an exisiting infrastructure that I found in LLVM.
>

The DataCollector implementation raises a lot of concerns.  I know you are
just working off an existing interface, but if it gains more users, fixing
it now is better than fixing it later.

addData functions
I've used FoldingNodeSetID as the collection point for data.  Doc:
http://llvm.org/doxygen/classllvm_1_1FoldingSetNodeID.html
It has an overloaded AddInteger function, plus AddBoolean and AddString
function.  Having one addData for integers, strings, and QualTypes is
confusing, plus the proposed addData for range sizes.

llvm::hash_code/llvm::hash_value
llvm::hash_value functions return a llvm::hash_code.  The comment in the
definition of hash_code say "this value should not be trusted to be stable
or predictable across processes or executions."  That means it is not
suitable anywhere a stable hash is required.

Printing to strings
Don't use any method for printing to strings then pushing the strings to
addData.  Printing to string takes extra work and is based on information
can be gathered from the node anyway.  Just add that information and skip
the printing.

> - The CHashVisitor only is a user of the provided infrastructure. It
>   provides a semantic[2] hash for C.  At the moment the
>   CHashVisitor does not cover all C++ AST nodes. C++ is the next step,
>   after merging this change.
>
> @Regarding Tests: There is an extensive test suite for CHash[3] in our
> external
>  repository. I only included a very rudimentary unittests into this
>  patch. After we resolved the discussion about the best principle how to
>  handle AST Hashing in clang, I will integrate and consolidate our tests
>  into CHashTest.cpp.
>
> @New Visit Functions in CHashVisitor: The CHashVisitor overrides some
>   Visit* functions from the DataCollector infrastructure to include
>   information form cross-tree references to AST nodes (like types) to
>   accomplish its goal of an semantic hash. By overriding these in the
>   user of the DataCollector, the central DataCollector infrastructure
>   can remain "node local" while CHashVisitor includes links to other
>   nodes that can influence the semantic of the AST node.
>
  By this separation of concers, a family of AST hashers should be
>   impementable without reinventing the wheel over and over again.
>
> I still have questions about local versus cross-ref.  It looks like you
allow all the functions to call Type's, but don't allow them to call Decl's
and Expr's.  Is that the distinction you are going for?  Also, a bunch of
stuff label cross-ref is just sizes of a collection without processing the
collection itself.

Coming from the ODRHash, Type's, Decl's, and Expr's were free to visit each
other.  I did create two levels of Decl visiting to help avoid the
recursion this caused.  And the only time a collection sized was used was
when the elements of the collection were processed next.

> Richard Trieu via cfe-dev <cfe-dev at lists.llvm.org> writes:
>
> > I posted some questions to the phab site about the design of the
> hasher.  I
> > suggest you try your hasher on clang/test/Modules/odr_hash.h with the
> flag
> > -DFIRST.  I put many of the issues I encountered into there.
>
> Thank you for that pointer. The file actually revealed some problem
> (read stackoverflows) that the current CHashVisitor included for CXX
> code. This included that CXXThisExpr references to the enclosing
> CXXRecord and that NamedDecl::getName() cannot be called on all CXX
> nodes. However, now CHashVisitor can process the odr_hash.h.
>
>
> > I'm still iffy on how basing all the hashers on *Collectors.td would
> work.
> > Is there enough flexibility for changes each hasher needs that it
> wouldn't
> > result in a lot of custom code anyways?
>
> But wouldn't that be an argument to include more hashers with totally
> different implementations? Howver, even if there are custom changes for
> different hashers, basing them on TableGen and RecursiveASTVisitor is
> probably the best route to handle the issue. Especially, if we consider
> including something like D40781.
>
> Regarding ODR Hash: Can you describe the semantic of the ODR hashing in
> a few sentences? Under what circumstance does the hash change? What is
> the high-level principle? Under what circumstances is a hash collision
> possible.
>

Ideally, hash collisions will only occur on token-wise identical code.  If
two pieces of code have identical behavior, but was created with different
tokens, their hashes would differ.  This is so that every hash mismatch
would be the result of an ODR violation.  The ODRHash is a work in
progress, so any unsupported AST nodes show up the same as an empty nodes,
since having false hash collision is better than having a false hash match.

>
> Vassil Vassilev on Phabricator wrote:
>
> > I am not sure if this is the right time but maybe testing and comparing
> > the behavior of this patch against something such as the ODRHash will
> > help you, me (and probably others) to understand the current patch
> > better.
>
> I think that ODRHash.cpp and StmtProfile.cpp include very valuable
> knowledge what should should be included into an AST hash and how to
> handle edge cases. Especially when it comes to fully handle C++ nodes.
>
> However, CHashVisitor does not try to replace ODRHash/StmtProfiler, it
> has its own purpose. The goal of this patch is to start consolidating
> these mechanisms. For this consolidation, I took the time to rewrite
> CHash to be based on RecursiveASTVisitor and DataCollector. As a result,
> the CHashVisitor dropped down to 320 lines of custom code to establish
> cross-tree links.
>
> > In https://reviews.llvm.org/D41416, [...], essentially I am comparing
> > the behavior of the ODRHash and the Stmt::Profile logic for some
> > cases.
> >
> > Perhaps you can plugin your algorithm to it, create a pch and start
> > comparing your logic with the when it comes to template arguments
> > hashing and alike.
> >
> > Of course this is not something that will give you perfect coverage
> > and so on but it can stress test the implementation and compare it to
> > the 2 major existing ones in clang.
>
> For C code, we already stress tested CHash with 2300 commits from musl
> and the CHash AST hash always changed when the object file changed (no
> false negative). Furthermore, we compiled 500 commits from lua, mbedTLS,
> musl, bash, CPython and PostgreSQL. So at least for C code I'm very
> confident that CHashVisitor handles a lot of code in the wild.
>
> chris
>
> [1] This principle breaks at the "// CrossRef" Sections. Since there I
>    call addData() on range<> and ArrayRef<> types to give the user of
>    the DataCollector the chance to include the number of children into
>    its hash.
>
> [2] I refer you to the previous messages from this discussion and to the
>     Usenix paper for the term "semantic hash".
>
> [3] https://github.com/luhsra/chash/tree/master/test
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  dietrich at sra.uni-hannover.de
> WWW:    https://www.sra.uni-hannover.de
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180316/ad7748ee/attachment.html>