<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 12, 2017 at 12:24 AM, Christian Dietrich <span dir="ltr"><<a href="mailto:dietrich@sra.uni-hannover.de" target="_blank">dietrich@sra.uni-hannover.de</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>Richard Trieu <<a href="mailto:rtrieu@google.com" target="_blank">rtrieu@google.com</a>> writes:<br>

<br>

> On Mon, Dec 11, 2017 at 5:14 AM, Christian Dietrich <<br>

> <a href="mailto:dietrich@sra.uni-hannover.de" target="_blank">dietrich@sra.uni-hannover.de</a>> wrote:<br>

><br>

>> Richard Trieu <<a href="mailto:rtrieu@google.com" target="_blank">rtrieu@google.com</a>> writes:<br>

>><br>

>> > I don't understand your need for pseudrandom identifiers.  If a specific<br>

>> > type of AST node (for instance, a Decl) is expected, then adding to the<br>

>> > stream the node kind (like from Decl::getKind()) is enough to avoid<br>

>> > collisions.  If any node can be expected, then first pass in an enum to<br>

>> > differentiate the different types (Decl, Stmt, Type, Attr, etc), then add<br>

>> > the node kind, and then continue on.  Stmt::Profile and ODRHash use this<br>

>> > method to avoid collisions.  I don't see any gain from using new<br>

>> > identifiers here.<br>

>><br>

>> Ok, so let's explain a little bit my idea here: Of course, including the<br>

>> Decl::getKind() also distinguishes node types from each other. However,<br>

>> these values come from enums and are, therefore, small integers. Small<br>

>> integers (SI) occur very often in an AST. So, if a sequence of addData()<br>

>> operations produces the same sequence of SIs because node types are not<br>

>> distinguishable from SI that occur in the AST, we get a collision[1].<br>

>><br>

> How does the result of Decl::getKind() being a small integer allows any<br>

> chance of collision?  I'll admit that arbitrary sub-sequences of different<br>

> hash streams can be equal, but we shouldn't be hashing sub-sequences.<br>

<br>

</span>Sure, we shouldn't take an arbitrary subsequence of the stream and hash<br>

it. That would be a bad idea.<br>

<span><br>

> Decl<br>

> <Decl::getKind()><br>

><br>

> Decl with two fields<br>

> <Decl::getKind()> <Field 1> < Field 2><br>

><br>

> Decl with one sub-Decl<br>

> <Decl::getKind()> <Stream of sub-decl><br>

><br>

> Decl with arbitrary number of sub-Decl's<br>

> <Decl::getKind()> <Number of sub-Decl's> <Stream of sub-decl 1> ... <Stream<br>

> of sub-decl N><br>

><br>

> The result of Decl::getKind() will basically specify the format of the hash<br>

> stream.  I think better controlling of the input stream is a better way of<br>

> avoiding collisions then using magic numbers as identifiers.<br>

><br>

> We could see the same sequence for different ASTs, if an AST node has a<br>

<br>

</span>Sure, this would definitly be a better option. A quick look into the<br>

RecursiveASTVisitor tells me that there are at least 76 for (...) loops.<br>

So we have to add them.<br>

<br>

Perhaps we can combine both, length-indicators and and<br>

non-small-integers, by using llvm::hash on the <Decl::getKind()> values.<br>

As the documentation says, we have to spent less than 20 cycles per<br>

32-bit value on this. By this, we avoid the magic numbers. I will start<br>

on doing the changes to incorporate the length fields.<br>

<br>

Furthermore, I'm not sure how the clang community handles the "how to<br>

find a reviewer for my patch". Are people adding themselves to be a<br>

reviewer? Or do I just add people how I think could be interessted in<br>

reviewing the change?<br>

<br></blockquote><div> People will add themselves as reviewers, so you can CC people that you think may have an interest in it.  If neither happens, adding a code owner and they will help find reviewers.</div><div><br></div><div>Personally, I've only looked through small portions of your patch partially because I have other projects to work on and partially because I was waiting on your answers to Richard Smith's queries.  Basically, your plan for integrating with the existing hash functions and how the other hash functions, especially ODRHash, doesn't fill your need.  From what I gather, you think the ODRHash is too strict on the AST structure.  I would like to understand what differences in the AST that you are fine ignoring.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

chris<br>

<div class="m_9165417226980234417m_-1276671230148914722HOEnZb"><div class="m_9165417226980234417m_-1276671230148914722h5">--<br>

Christian Dietrich, M.Sc. (Scientific Staff)<br>

Institute for Systems Engineering (Systems and Computerarchitecture)<br>

Leibniz Universität Hannover<br>

Appelstraße 4<br>

30167 Hannover, Germany<br>

<br>

Tel:    <a href="tel:%2B49%20511%20762-19737" value="+4951176219737" target="_blank">+49 511 762-19737</a><br>

Fax:    <a href="tel:%2B49%20511%20762-19733" value="+4951176219733" target="_blank">+49 511 762-19733</a><br>

eMail:  <a href="mailto:dietrich@sra.uni-hannover.de" target="_blank">dietrich@sra.uni-hanno<wbr>ver.de</a><br>

WWW:    <a href="https://www.sra.uni-hannover.de" rel="noreferrer" target="_blank">https://www.sra.uni-ha<wbr>nnover.de</a><br>

</div></div></blockquote></div><br></div></div>