[cfe-dev] [RFC] cHash: Integration of AST hashing

Thu Aug 3 06:55:36 PDT 2017

That sounds like a really nice tool!

One problem I see is that we get yet another AST hashing
implementation in clang. I'm aware of at least four of them that I
personally work with:

* The normal Stmt::Profile for Stmts
* The ODR hashing for AST nodes in modules.
* The ASTDiff also has one
* The CloneDetector with its StmtDataCollector which can be
specialized for hashing

They all contain similar code and have similar goals. And they all
contain different kind of bugs. And that's just the implementations
that I'm aware of.

I would very much prefer if we don't add more redundant code here but
instead try to figure out if we can get all this code in a more
central place.

Most of the use cases are just about

> I have this node, please forward all interesting data into this stream so I can [store|hash|...] it and I [do|don't] care about having cross-TU stable data and maybe I want to treat special nodes in a different way

so it's not necessarily a difficult problem to solve. If we find a
common baseline for this, we could make one visitor that people can
specialize their specific hashing needs on top.

I already started doing this for the Stmt hashing when I refactored
the StmtDataCollector. I set a common baseline that I could turn my
two hashing implementations back then into a single one and I'm aiming
at also replacing the Stmt::Profile implementations with it.

Not sure if anyone started working on something similar for generic
AST nodes (which is probably a bit more complicated than the Stmts
which are relatively simple), but maybe we should see if we can get
some people together to work on that. Otherwise we will have a dozen
hashing implementations in a few years that no one can/will maintain.

- Raphael

2017-08-03 12:41 GMT+02:00 Christian Dietrich via cfe-dev
<cfe-dev at lists.llvm.org>:
> Hello!
>
> First, I'd like to give you a little bit of context, before I get into
> the details what all of this has to do with the cfe-dev list.
>
> TL;DR: cHash calculates an hash over the AST and, therefore, allows
>        the detection of redundant builds.
>
> In the last months, we have been working on a mechanism to avoid
> redundant recompilations. In C/C++ projects, we observe that a change
> to a central header file can cause a lot of compiler invocations.
> However, not all compiler runs will lead to an object file that
> differs from the last invocation. For example, if I add a typedef
> that is never used in any .c file, none of the object files will
> change.
>
> Our approach to this problem is that we hook into the compiler front
> end, after the parsing and the semantic analysis and recursively
> calculate an AST hash. We start with all definitions (variables and
> functions) and include all relevant AST-node properties and the hashes
> of all children and referenced types. Thereby, we get a hash value
> that changes if a changed translation unit would result in a different
> object file. As a byproduct, we get a hash for every function, every
> referenced type, and all global variables.
>
> So far, we have only implemented cHash for C, not yet for C++. We have
> already published our current version of the clang plugin[1] and are
> also working on a GCC version.
>
> Furthermore, we presented our approach at USENIX ATC (and got a best
> paper award, yay!)[2]. Now, we consider proposing some parts of cHash
> for mainline.
>
> Our impression is that the AST hashing mechanism[3], which calculates
> a semantic fingerprint for a given AST node is not only useful to
> detect redundant builds. AST hashes are (computationally) fast to
> calculate, easy to store, easy to calculate, and they capture all of
> the semantic of an AST node. These properties should ease other techniques:
>
> - partial recompilation: we now know at an early state that a function
>   body has changed directly or indirectly.
> - change impact analysis: we can now quantify which parts of the
>   program is actually touched, even if we only modfiy a type.
> - project-wide analysis: have all types with the same name also
>   the name hash, or is there an inconsistency?
>
> Besides the AST hashing mechanism, cHash also includes the caching and
> compiler abortion logic. However, in my opinion this is better located
> in an external tool, or a plugin. For example, a more close
> integration of CLang with CCache, could give the compilation speedups
> that we have observed in our prototype.
>
> Before I start on preparing parts of the chash plugin to go upstream
> (there is definitely some cleanup required), I would love to hear your
> opinions on the following topics:
>
> - Would you like to see the AST hashing in clang mainline?
> - Would you like to have a standalone tool that calculates AST hashes?
> - Should the detection of redundant compilations be part of CLang mainline?
> - Should we integrate the test-suite[4] into the repo? If yes, how?
>
> chris
>
> [1] https://github.com/luhsra/chash
> [2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
> [3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
> [4] https://github.com/luhsra/chash/tree/master/test
>     https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
> --
> Christian Dietrich, M.Sc. (Scientific Staff)
> Institute for Systems Engineering (Systems and Computerarchitecture)
> Leibniz Universität Hannover
> Appelstraße 4
> 30167 Hannover, Germany
>
> Tel:    +49 511 762-19737
> Fax:    +49 511 762-19733
> eMail:  dietrich at sra.uni-hannover.de
> WWW:    https://www.sra.uni-hannover.de
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev