<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Mon, Aug 7, 2017 at 1:05 AM, Christian Dietrich via cfe-dev <span dir="ltr"><<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>Raphael Isemann <<a href="mailto:teemperor@gmail.com" target="_blank">teemperor@gmail.com</a>> writes:<br>
<br>
> That sounds like a really nice tool!<br>
<br>
</span>I also really like the general principle of avoiding work that is<br>
redundant in the end. Let's save some polar bears! And some developers<br>
from serious injuries stemming from sword fights.<br>
<span><br>
> One problem I see is that we get yet another AST hashing<br>
> implementation in clang. I'm aware of at least four of them that I<br>
> personally work with:<br>
><br>
> * The normal Stmt::Profile for Stmts<br>
<br>
</span>This does indeed look structurally similar to our implementation for<br>
statements. However, types are only included as means of their pointers<br>
and, therefore, the hash is not stable over compilations. I wonder for<br>
which applications this is actually sufficient. And, does it really<br>
speed up things to badly?<br>
<span><br>
> * The ODR hashing for AST nodes in modules.<br>
<br>
</span>This is also based on StmtProfilerWithoutPointers as far as I can<br>
tell[1]. And it is stable over invocations as it actually handles types.<br>
However, changing the order of declarations[1] will alter the hash. Also<br>
one thing we did in cHash is to use pseudrandom identifiers for<br>
different ast node types[3] to avoid some of the possible collisions. If<br>
you only add small integer values (getKind(), etc.) It is much easier to<br>
have a collision. Therefore, we add something like<br>
<br>
<UNIQUE-NODE-CLASS-ID> <FIELD1> <FIELD2> <FIELD3><br>
<br>
to the hashing stream most of the time. Ideally, the text that goes into<br>
the hash should be usable to reconstruct an AST tree with the same<br>
structure; not neccessarily with the same information.<br>
<span><br></span></blockquote><div>I don't understand your need for pseudrandom identifiers. If a specific type of AST node (for instance, a Decl) is expected, then adding to the stream the node kind (like from Decl::getKind()) is enough to avoid collisions. If any node can be expected, then first pass in an enum to differentiate the different types (Decl, Stmt, Type, Attr, etc), then add the node kind, and then continue on. Stmt::Profile and ODRHash use this method to avoid collisions. I don't see any gain from using new identifiers here.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>
> * The ASTDiff also has one<br>
<br>
</span>Ok, then I was unable to find it. Can you give me a pointer<br>
<span><br>
> * The CloneDetector with its StmtDataCollector which can be<br>
> specialized for hashing<br>
><br>
> They all contain similar code and have similar goals. And they all<br>
> contain different kind of bugs. And that's just the implementations<br>
> that I'm aware of.<br>
<br>
</span>So, when they all contain some kinds of quirks, we really should build<br>
a good test harness around a unified implementation. Since nobody wants<br>
to break the "give me a unique identifier" silently.<br>
<span><br>
> I would very much prefer if we don't add more redundant code here but<br>
> instead try to figure out if we can get all this code in a more<br>
> central place.<br>
<br>
</span>I like that idea and I do not necessarily stick to our implementation.<br>
<span><br>
> Most of the use cases are just about<br>
><br>
>> I have this node, please forward all interesting data into this<br>
>> stream so I can [store|hash|...] it and I [do|don't] care about<br>
>> having cross-TU stable data and maybe I want to treat special nodes<br>
>> in a different way<br>
><br>
> so it's not necessarily a difficult problem to solve. If we find a<br>
> common baseline for this, we could make one visitor that people can<br>
> specialize their specific hashing needs on top.<br>
<br>
</span>Your are right, it is not a difficult problem in the end. But when doing<br>
It for the whole AST there are some weird corner cases that should be<br>
handled correctly (recursive type definitions come to my mind)<br>
<span><br>
> I already started doing this for the Stmt hashing when I refactored<br>
> the StmtDataCollector. I set a common baseline that I could turn my<br>
> two hashing implementations back then into a single one and I'm aiming<br>
> at also replacing the Stmt::Profile implementations with it.<br>
><br>
> Not sure if anyone started working on something similar for generic<br>
> AST nodes (which is probably a bit more complicated than the Stmts<br>
> which are relatively simple), but maybe we should see if we can get<br>
> some people together to work on that. Otherwise we will have a dozen<br>
> hashing implementations in a few years that no one can/will maintain.<br>
<br>
</span>Bringing cHash to mainline is based on the wish that I don't want to<br>
maintain an AST hashing mechanism in parallel to the clang development.<br>
Therefore, I would love to help with that. We should collect some<br>
requirements that the different users of unique identifiers have. And<br>
perhaps we can make them all happy by introducing a few switches to a<br>
unified visitor.<br>
<br>
Some switches and requirements come directly to my mind:<br>
<br>
- Syntactical vs. Semantic: There should be a switch to include only<br>
nodes that are actually syntacical children of the given node. Not<br>
everybody wants to include the transitive hull over all referenced AST<br>
nodes.<br>
<br>
- Multiple Hashes in one pass: With one pass over the AST I want to have<br>
not only one top-level hash, but also hashes of some interesting nodes<br>
(top-level definitions, type declarations, global variables). cHash<br>
already caches hashes for some nodes in a map to speed things up.<br>
<br>
- Debug information: I want to turn on/off the inclusion of debugging<br>
information (line numbers, file names, local identifier names).<br>
<br>
- Stable?: Should the hash be stable over compiler invocations? Or can<br>
we take some shortcuts for types/declarations/things that are not<br>
syntactially enclosed into the node.<br>
<br>
chris<br>
<div class="m_-2130632900882588549m_5587199531825099154HOEnZb"><div class="m_-2130632900882588549m_5587199531825099154h5">--<br>
Christian Dietrich, M.Sc. (Scientific Staff)<br>
Institute for Systems Engineering (Systems and Computerarchitecture)<br>
Leibniz Universität Hannover<br>
Appelstraße 4<br>
30167 Hannover, Germany<br>
<br>
Tel: <a href="tel:%2B49%20511%20762-19737" value="+4951176219737" target="_blank">+49 511 762-19737</a><br>
Fax: <a href="tel:%2B49%20511%20762-19733" value="+4951176219733" target="_blank">+49 511 762-19733</a><br>
eMail: <a href="mailto:dietrich@sra.uni-hannover.de" target="_blank">dietrich@sra.uni-hanno<wbr>ver.de</a><br>
WWW: <a href="https://www.sra.uni-hannover.de" rel="noreferrer" target="_blank">https://www.sra.uni-ha<wbr>nnover.de</a><br>
______________________________<wbr>_________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/cfe-dev</a><br>
</div></div></blockquote></div><br></div></div>