[cfe-dev] [RFC] cHash: Integration of AST hashing

Thu Aug 3 03:41:53 PDT 2017

Hello!

First, I'd like to give you a little bit of context, before I get into
the details what all of this has to do with the cfe-dev list.

TL;DR: cHash calculates an hash over the AST and, therefore, allows
       the detection of redundant builds.

In the last months, we have been working on a mechanism to avoid
redundant recompilations. In C/C++ projects, we observe that a change
to a central header file can cause a lot of compiler invocations.
However, not all compiler runs will lead to an object file that
differs from the last invocation. For example, if I add a typedef
that is never used in any .c file, none of the object files will
change.

Our approach to this problem is that we hook into the compiler front
end, after the parsing and the semantic analysis and recursively
calculate an AST hash. We start with all definitions (variables and
functions) and include all relevant AST-node properties and the hashes
of all children and referenced types. Thereby, we get a hash value
that changes if a changed translation unit would result in a different
object file. As a byproduct, we get a hash for every function, every
referenced type, and all global variables.

So far, we have only implemented cHash for C, not yet for C++. We have
already published our current version of the clang plugin[1] and are
also working on a GCC version.

Furthermore, we presented our approach at USENIX ATC (and got a best
paper award, yay!)[2]. Now, we consider proposing some parts of cHash
for mainline.

Our impression is that the AST hashing mechanism[3], which calculates
a semantic fingerprint for a given AST node is not only useful to
detect redundant builds. AST hashes are (computationally) fast to
calculate, easy to store, easy to calculate, and they capture all of
the semantic of an AST node. These properties should ease other techniques:

- partial recompilation: we now know at an early state that a function
  body has changed directly or indirectly.
- change impact analysis: we can now quantify which parts of the
  program is actually touched, even if we only modfiy a type.
- project-wide analysis: have all types with the same name also
  the name hash, or is there an inconsistency?

Besides the AST hashing mechanism, cHash also includes the caching and
compiler abortion logic. However, in my opinion this is better located
in an external tool, or a plugin. For example, a more close
integration of CLang with CCache, could give the compilation speedups
that we have observed in our prototype.

Before I start on preparing parts of the chash plugin to go upstream
(there is definitely some cleanup required), I would love to hear your
opinions on the following topics:

- Would you like to see the AST hashing in clang mainline?
- Would you like to have a standalone tool that calculates AST hashes?
- Should the detection of redundant compilations be part of CLang mainline?
- Should we integrate the test-suite[4] into the repo? If yes, how?

chris

[1] https://github.com/luhsra/chash
[2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/dietrich
[3] https://github.com/luhsra/chash/blob/master/clang-plugin/hash-visitor.cc
[4] https://github.com/luhsra/chash/tree/master/test
    https://github.com/luhsra/chash/blob/master/test/control_flow/if/IfStmt_10.c
-- 
Christian Dietrich, M.Sc. (Scientific Staff)
Institute for Systems Engineering (Systems and Computerarchitecture)
Leibniz Universität Hannover
Appelstraße 4
30167 Hannover, Germany

Tel:    +49 511 762-19737
Fax:    +49 511 762-19733
eMail:  dietrich at sra.uni-hannover.de
WWW:    https://www.sra.uni-hannover.de