[cfe-dev] [RFC] cHash: Integration of AST hashing
Johannes Altmanninger via cfe-dev
cfe-dev at lists.llvm.org
Mon Aug 14 02:00:47 PDT 2017
Raphael Isemann via cfe-dev <cfe-dev at lists.llvm.org> writes:
> That sounds like a really nice tool!
>
> One problem I see is that we get yet another AST hashing
> implementation in clang. I'm aware of at least four of them that I
> personally work with:
>
> * The normal Stmt::Profile for Stmts
> * The ODR hashing for AST nodes in modules.
> * The ASTDiff also has one
> * The CloneDetector with its StmtDataCollector which can be
> specialized for hashing
>
> They all contain similar code and have similar goals. And they all
> contain different kind of bugs. And that's just the implementations
> that I'm aware of.
I have done some work on making the StmtDataCollector approach
extensible [1] (it was not possible to subclass it due to the recurring
template pattern style inheritance of ConstStmtVisitor).
I think the DataCollector approach is the best one. Note that it only
collects data from a single node, so it makes sense to combine it with a
RecursiveASTVisitor in order to hash entire subtrees.
It should be possible to rewrite ODR hashing and the others with this!
(Currently the hashing for IntegerLiteral is not stable because it uses
hash_value for APInt but I imagine this can be fixed easily)
>
> I would very much prefer if we don't add more redundant code here but
> instead try to figure out if we can get all this code in a more
> central place.
>
> Most of the use cases are just about
>
>> I have this node, please forward all interesting data into this stream so I can [store|hash|...] it and I [do|don't] care about having cross-TU stable data and maybe I want to treat special nodes in a different way
>
> so it's not necessarily a difficult problem to solve. If we find a
> common baseline for this, we could make one visitor that people can
> specialize their specific hashing needs on top.
>
> I already started doing this for the Stmt hashing when I refactored
> the StmtDataCollector. I set a common baseline that I could turn my
> two hashing implementations back then into a single one and I'm aiming
> at also replacing the Stmt::Profile implementations with it.
>
> Not sure if anyone started working on something similar for generic
> AST nodes
I am also working on a DeclDataCollector. I still have to come up with
a good way to test that two nodes indeed have a different hash value.
> (which is probably a bit more complicated than the Stmts
> which are relatively simple), but maybe we should see if we can get
> some people together to work on that. Otherwise we will have a dozen
> hashing implementations in a few years that no one can/will maintain.
>
> - Raphael
>
>
Johannes
[1] https://reviews.llvm.org/D36664
>
>
>
>
> 2017-08-03 12:41 GMT+02:00 Christian Dietrich via cfe-dev
> <cfe-dev at lists.llvm.org>:
>> Hello!
>>
>> First, I'd like to give you a little bit of context, before I get into
>> the details what all of this has to do with the cfe-dev list.
>>
>> TL;DR: cHash calculates an hash over the AST and, therefore, allows
>> the detection of redundant builds.
>>
>> In the last months, we have been working on a mechanism to avoid
>> redundant recompilations. In C/C++ projects, we observe that a change
>> to a central header file can cause a lot of compiler invocations.
>> However, not all compiler runs will lead to an object file that
>> differs from the last invocation. For example, if I add a typedef
>> that is never used in any .c file, none of the object files will
>> change.
>>
>> Our approach to this problem is that we hook into the compiler front
>> end, after the parsing and the semantic analysis and recursively
>> calculate an AST hash. We start with all definitions (variables and
>> functions) and include all relevant AST-node properties and the hashes
>> of all children and referenced types. Thereby, we get a hash value
>> that changes if a changed translation unit would result in a different
>> object file. As a byproduct, we get a hash for every function, every
>> referenced type, and all global variables.
>>
>> So far, we have only implemented cHash for C, not yet for C++. We have
>> already published our current version of the clang plugin[1] and are
>> also working on a GCC version.
>>
>> Furthermore, we presented our approach at USENIX ATC (and got a best
>> paper award, yay!)[2]. Now, we consider proposing some parts of cHash
>> for mainline.
>>
>> Our impression is that the AST hashing mechanism[3], which calculates
>> a semantic fingerprint for a given AST node is not only useful to
>> detect redundant builds. AST hashes are (computationally) fast to
>> calculate, easy to store, easy to calculate, and they capture all of
>> the semantic of an AST node. These properties should ease other techniques:
>>
>> - partial recompilation: we now know at an early state that a function
>> body has changed directly or indirectly.
>> - change impact analysis: we can now quantify which parts of the
>> program is actually touched, even if we only modfiy a type.
>> - project-wide analysis: have all types with the same name also
>> the name hash, or is there an inconsistency?
>>
>> Besides the AST hashing mechanism, cHash also includes the caching and
>> compiler abortion logic. However, in my opinion this is better located
>> in an external tool, or a plugin. For example, a more close
>> integration of CLang with CCache, could give the compilation speedups
>> that we have observed in our prototype.
>>
>> Before I start on preparing parts of the chash plugin to go upstream
>> (there is definitely some cleanup required), I would love to hear your
>> opinions on the following topics:
>>
>> - Would you like to see the AST hashing in clang mainline?
>> - Would you like to have a standalone tool that calculates AST hashes?
>> - Should the detection of redundant compilations be part of CLang mainline?
>> - Should we integrate the test-suite[4] into the repo? If yes, how?
>>
>> chris
>>
>> [1] https://github.com/luhsra/chash
>> [2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
>> [3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
>> [4] https://github.com/luhsra/chash/tree/master/test
>> https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
>> --
>> Christian Dietrich, M.Sc. (Scientific Staff)
>> Institute for Systems Engineering (Systems and Computerarchitecture)
>> Leibniz Universität Hannover
>> Appelstraße 4
>> 30167 Hannover, Germany
>>
>> Tel: +49 511 762-19737
>> Fax: +49 511 762-19733
>> eMail: dietrich at sra.uni-hannover.de
>> WWW: https://www.sra.uni-hannover.de
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
More information about the cfe-dev
mailing list