<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">The original StmtDataCollector was made for StmtVisitor, not RecursiveASTVisitor.  There's a difference in how the function work that needs to be noted.</div><div class="gmail_quote"><br></div><div class="gmail_quote">I don't like using the RecursiveASTVisitor in this form.  It was designed so that users can define custom Visit* functions to occur when an AST node is traversed.  The most evident part of this is that cHash needed to include sizes of various containers far from the point where they are iterated over.  The size being added is decouped from where it's used, and can be inaccurate if nodes are skipped.  I believe you picked this way because of how much code you saved, but it also hides away how each part of the AST is treated.</div><div class="gmail_quote"><br></div><div class="gmail_quote">You are also using printing on some nodes.  Although it works, printing should be avoided.  </div><div class="gmail_quote"><br></div><div class="gmail_quote">Another point I'd like to bring up is about the hashing itself.  The method you are using is to push individual values to the hash.  Other places use a vector to store values and only pushing it to hash as a large chunk when finished.  I'm not an expert in hashing, but I've thought that hashing would work faster on a large chunk than spread out over smaller values.</div><div class="gmail_quote"><br></div><div class="gmail_quote">On Mon, Mar 12, 2018 at 8:39 AM, Christian Dietrich <span dir="ltr"><<a href="mailto:dietrich@sra.uni-hannover.de" target="_blank">dietrich@sra.uni-hannover.de</a>></span> wrote:<br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">So first, I want to make some points clear:<br>

<br>

- The patch extends the already existing DataCollector interface to<br>

  support more more AST node types. This extension tries to accomplish<br>

  the principle to include only data that is local to that node[1]. So<br>

  here, I do not introduce a new hashing infrastructure, but merely<br>

  extend an exisiting infrastructure that I found in LLVM.<br></blockquote><div><br></div><div>The DataCollector implementation raises a lot of concerns.  I know you are just working off an existing interface, but if it gains more users, fixing it now is better than fixing it later.</div><div><br></div><div>addData functions</div><div>I've used FoldingNodeSetID as the collection point for data.  Doc: <a href="http://llvm.org/doxygen/classllvm_1_1FoldingSetNodeID.html" target="_blank">http://llvm.org/doxygen/c<wbr>lassllvm_1_1FoldingSetNodeID.h<wbr>tml</a></div><div>It has an overloaded AddInteger function, plus AddBoolean and AddString function.  Having one addData for integers, strings, and QualTypes is confusing, plus the proposed addData for range sizes.</div><div><br></div><div>llvm::hash_code/llvm::hash_val<wbr>ue</div><div>llvm::hash_value functions return a llvm::hash_code.  The comment in the definition of hash_code say "this value should not be trusted to be stable or predictable across processes or executions."  That means it is not suitable anywhere a stable hash is required.</div><div><br></div><div>Printing to strings</div><div>Don't use any method for printing to strings then pushing the strings to addData.  Printing to string takes extra work and is based on information can be gathered from the node anyway.  Just add that information and skip the printing.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

- The CHashVisitor only is a user of the provided infrastructure. It<br>

  provides a semantic[2] hash for C.  At the moment the<br>

  CHashVisitor does not cover all C++ AST nodes. C++ is the next step,<br>

  after merging this change.<br>

<br>

@Regarding Tests: There is an extensive test suite for CHash[3] in our external<br>

 repository. I only included a very rudimentary unittests into this<br>

 patch. After we resolved the discussion about the best principle how to<br>

 handle AST Hashing in clang, I will integrate and consolidate our tests<br>

 into CHashTest.cpp.<br>

<br>

@New Visit Functions in CHashVisitor: The CHashVisitor overrides some<br>

  Visit* functions from the DataCollector infrastructure to include<br>

  information form cross-tree references to AST nodes (like types) to<br>

  accomplish its goal of an semantic hash. By overriding these in the<br>

  user of the DataCollector, the central DataCollector infrastructure<br>

  can remain "node local" while CHashVisitor includes links to other<br>

  nodes that can influence the semantic of the AST node.<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">  By this separation of concers, a family of AST hashers should be<br>

  impementable without reinventing the wheel over and over again.<br>

<span class="m_352823422811810302m_-6288224909336001442m_3750292184076228268m_1578679895360628980m_3028579525732598875gmail-"><br></span></blockquote><div>I still have questions about local versus cross-ref.  It looks like you allow all the functions to call Type's, but don't allow them to call Decl's and Expr's.  Is that the distinction you are going for?  Also, a bunch of stuff label cross-ref is just sizes of a collection without processing the collection itself.</div><div><br></div><div>Coming from the ODRHash, Type's, Decl's, and Expr's were free to visit each other.  I did create two levels of Decl visiting to help avoid the recursion this caused.  And the only time a collection sized was used was when the elements of the collection were processed next.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="m_352823422811810302m_-6288224909336001442m_3750292184076228268m_1578679895360628980m_3028579525732598875gmail-">

Richard Trieu via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> writes:<br>

<br>

> I posted some questions to the phab site about the design of the hasher.  I<br>

> suggest you try your hasher on clang/test/Modules/odr_hash.h with the flag<br>

> -DFIRST.  I put many of the issues I encountered into there.<br>

<br>

</span>Thank you for that pointer. The file actually revealed some problem<br>

(read stackoverflows) that the current CHashVisitor included for CXX<br>

code. This included that CXXThisExpr references to the enclosing<br>

CXXRecord and that NamedDecl::getName() cannot be called on all CXX<br>

nodes. However, now CHashVisitor can process the odr_hash.h.<br>

<span class="m_352823422811810302m_-6288224909336001442m_3750292184076228268m_1578679895360628980m_3028579525732598875gmail-"><br>

<br>

> I'm still iffy on how basing all the hashers on *Collectors.td would work.<br>

> Is there enough flexibility for changes each hasher needs that it wouldn't<br>

> result in a lot of custom code anyways?<br>

<br>

</span>But wouldn't that be an argument to include more hashers with totally<br>

different implementations? Howver, even if there are custom changes for<br>

different hashers, basing them on TableGen and RecursiveASTVisitor is<br>

probably the best route to handle the issue. Especially, if we consider<br>

including something like D40781.<br>

<br>

Regarding ODR Hash: Can you describe the semantic of the ODR hashing in<br>

a few sentences? Under what circumstance does the hash change? What is<br>

the high-level principle? Under what circumstances is a hash collision<br>

possible.<br></blockquote><div><br></div><div>Ideally, hash collisions will only occur on token-wise identical code.  If two pieces of code have identical behavior, but was created with different tokens, their hashes would differ.  This is so that every hash mismatch would be the result of an ODR violation.  The ODRHash is a work in progress, so any unsupported AST nodes show up the same as an empty nodes, since having false hash collision is better than having a false hash match.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Vassil Vassilev on Phabricator wrote:<br>

<br>

> I am not sure if this is the right time but maybe testing and comparing<br>

> the behavior of this patch against something such as the ODRHash will<br>

> help you, me (and probably others) to understand the current patch<br>

> better.<br>

<br>

I think that ODRHash.cpp and StmtProfile.cpp include very valuable<br>

knowledge what should should be included into an AST hash and how to<br>

handle edge cases. Especially when it comes to fully handle C++ nodes.<br>

<br>

However, CHashVisitor does not try to replace ODRHash/StmtProfiler, it<br>

has its own purpose. The goal of this patch is to start consolidating<br>

these mechanisms. For this consolidation, I took the time to rewrite<br>

CHash to be based on RecursiveASTVisitor and DataCollector. As a result,<br>

the CHashVisitor dropped down to 320 lines of custom code to establish<br>

cross-tree links.<br>

<br>

> In <a href="https://reviews.llvm.org/D41416" rel="noreferrer" target="_blank">https://reviews.llvm.org/D4141<wbr>6</a>, [...], essentially I am comparing<br>

> the behavior of the ODRHash and the Stmt::Profile logic for some<br>

> cases.<br>

><br>

> Perhaps you can plugin your algorithm to it, create a pch and start<br>

> comparing your logic with the when it comes to template arguments<br>

> hashing and alike.<br>

><br>

> Of course this is not something that will give you perfect coverage<br>

> and so on but it can stress test the implementation and compare it to<br>

> the 2 major existing ones in clang.<br>

<br>

For C code, we already stress tested CHash with 2300 commits from musl<br>

and the CHash AST hash always changed when the object file changed (no<br>

false negative). Furthermore, we compiled 500 commits from lua, mbedTLS,<br>

musl, bash, CPython and PostgreSQL. So at least for C code I'm very<br>

confident that CHashVisitor handles a lot of code in the wild.<br>

<br>

chris<br>

<br>

[1] This principle breaks at the "// CrossRef" Sections. Since there I<br>

   call addData() on range<> and ArrayRef<> types to give the user of<br>

   the DataCollector the chance to include the number of children into<br>

   its hash.<br>

<br>

[2] I refer you to the previous messages from this discussion and to the<br>

    Usenix paper for the term "semantic hash".<br>

<br>

[3] <a href="https://github.com/luhsra/chash/tree/master/test" rel="noreferrer" target="_blank">https://github.com/luhsra/chas<wbr>h/tree/master/test</a><br>

<div class="m_352823422811810302m_-6288224909336001442m_3750292184076228268m_1578679895360628980m_3028579525732598875gmail-HOEnZb"><div class="m_352823422811810302m_-6288224909336001442m_3750292184076228268m_1578679895360628980m_3028579525732598875gmail-h5">--<br>

Christian Dietrich, M.Sc. (Scientific Staff)<br>

Institute for Systems Engineering (Systems and Computerarchitecture)<br>

Leibniz Universität Hannover<br>

Appelstraße 4<br>

30167 Hannover, Germany<br>

<br>

Tel:    <a href="tel:%2B49%20511%20762-19737" value="+4951176219737" target="_blank">+49 511 762-19737</a><br>

Fax:    <a href="tel:%2B49%20511%20762-19733" value="+4951176219733" target="_blank">+49 511 762-19733</a><br>

eMail:  <a href="mailto:dietrich@sra.uni-hannover.de" target="_blank">dietrich@sra.uni-hanno<wbr>ver.de</a><br>

WWW:    <a href="https://www.sra.uni-hannover.de" rel="noreferrer" target="_blank">https://www.sra.uni-ha<wbr>nnover.de</a><br>

</div></div></blockquote></div><br></div></div>