[PATCH] D102943: Hashing: use a 64-bit storage type on all platforms.

Duncan P. N. Exon Smith via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Jun 3 14:05:32 PDT 2021


dexonsmith added reviewers: jansvoboda11, Bigcheese.
dexonsmith added a subscriber: Bigcheese.
dexonsmith added a comment.

This looks much cleaner; thanks for the update!

I'm not sure this has decided whether it's writing special case code for building a CompilerInvocation's module context hash, or creating (and using) a new generalized hash builder.

If the latter, I think it'd be worth splitting the hash builder out and landing it separately (in LLVM), then adopting it here; I think that's what you were doing with `stable_hash_code` (although that had the downsides of reifying city hash and not reusing the Hasher tools we have). This could take a few rounds of review to get the design right / get consensus on... not sure what you're able to invest. A few design questions I think you'd want to think about:

- What's the interaction between hash_value and (e.g.) updateHash? (It'd be nice if implementing `updateHash` gave you an implementation of `hash_value` for free.)
- Should `updateHash` be templated on the HasherT, or use type erasure? (E.g., a stripped down `HashBuilder` type could have a single member that was `llvm::unique_function<void(ArrayRef<uint8_t>)>`, and all the `updateHash()` functions are just free functions)
- Should the hash builder reference a HasherT, or own it?
- What are the API affordances for hashing structure? (E.g., ensuring that calls to updateHash with "a" and "bc" give a different hash than "ab" and "c".)

On the other hand, if you're not going to have time to generalize this, I'd suggest simplifying the code a bit, merging the "detail" subclass into the builder type. I think ModuleContextHashBuilder might be a better name than ModuleHash; I think in clangBasic, rather than clangSerialization, since it could be used for things other than serialization and doesn't really have any dependencies there.

In either case, there's a decent chunk of new code, and I think you should add some unit test coverage (a new file in llvm/unittests/Support/, in a separate patch, and/or clang/unittests/ (depending on direction)).

A have a bunch of comments inline, some which would only apply depending on which direction you're going with this.

@Bigcheese and @jansvoboda11 , any thoughts?



================
Comment at: clang/include/clang/Serialization/ModuleHash.h:19-22
+/// Wrapper interface around a hash type.
+/// The only methods expected from the hash type are:
+/// * a default constructor
+/// * void update(llvm::ArrayRef<uint8_t>)
----------------
I think this builder is an interesting idea, and could be generally useful.

- It'd fit better in llvm/Support I think than hidden here.
- Taking the hasher by reference might be more flexible. Clients can ask the builder to do some of the hashing, but do some on their own.



================
Comment at: clang/include/clang/Serialization/ModuleHash.h:36
+/// ```
+template <typename HashValueT> class Hash {
+private:
----------------
I don't think the names `HashValueT` and `Hash` are quite right (I tend to think of "hash" as a value as well).

- Maybe HasherT instead of HashValueT?
- Maybe HashBuilder instead of Hash?


Nit: I think (?) we more often use `class` for type template parameters than `typename`.


================
Comment at: clang/include/clang/Serialization/ModuleHash.h:43
+public:
+  HashValueT HashValue;
+
----------------
For a builder interface, it might be more convenient to take the hasher by reference.


================
Comment at: clang/include/clang/Serialization/ModuleHash.h:47-51
+  template <typename InputIteratorT>
+  void updateRange(InputIteratorT first, InputIteratorT last) {
+    for (auto It = first; It != last; ++It)
+      update(*It);
+  }
----------------
Because this doesn't hash the distance of the range, it won't see a difference between `{{0}, {1, 2}}` and `{{0, 1}, {2}}`. Sometimes that's fine, but sometimes it isn't. I worry about generalizing this... unless the onus is on the caller to hash the size when appropriate?


================
Comment at: clang/include/clang/Serialization/ModuleHash.h:54-57
+  template <typename T>
+  struct is_hashable_data
+      : std::integral_constant<bool, llvm::is_integral_or_enum<T>::value ||
+                                         std::is_pointer<T>::value> {};
----------------
I'm not sure we want it to be easy to hash pointers by-value. That's appropriate for an in-memory hash table, but not much else.


================
Comment at: clang/include/clang/Serialization/ModuleHash.h:77-83
+  struct update_impl<T,
+                     typename std::enable_if<is_hashable_data<T>::value>::type>
+      : update_impl<llvm::ArrayRef<uint8_t>> {
+    update_impl(Hash &H, T value)
+        : update_impl<llvm::ArrayRef<uint8_t>>(
+              H, llvm::ArrayRef<uint8_t>(reinterpret_cast<uint8_t *>(&value),
+                                         sizeof(value))) {}
----------------
Is there a way of reducing the boilerplate for these `update_impl` functions, which are currently types? Could they maybe be free functions?

If you hit trouble with using std::enable_if_t in a function template, I seem to remember needing to use something like this in the past:
```
lang=c++
std::enable_if_t<whatever<T>::value, bool> = false
```



================
Comment at: clang/include/clang/Serialization/ModuleHash.h:96-104
+  // Forward `llvm::StringRef` to `llvm::ArrayRef<uint8_t>`.
+  template <>
+  struct update_impl<llvm::StringRef> : update_impl<llvm::ArrayRef<uint8_t>> {
+    update_impl(Hash &H, llvm::StringRef Value)
+        : update_impl<llvm::ArrayRef<uint8_t>>(
+              H, llvm::ArrayRef<uint8_t>(
+                     reinterpret_cast<const uint8_t *>(Value.data()),
----------------
I wonder if this should also hash the size of the StringRef. Maybe it should also be called `updateRange`.

MD5/SHA1/etc. hashers have the perspective that they're seeing a flat stream of bytes, and I think this streams in the bytes from the string directly. I think the right way to model hashing the context hash is as a serialization, and serializing a string ref requires serializing the size before the content, so that the deserializer knows how big the string is.


================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:4483-4484
   // Start the signature with the compiler version.
-  // FIXME: We'd rather use something more cryptographically sound than
-  // CityHash, but this will do for now.
-  hash_code code = hash_value(getClangFullRepositoryVersion());
----------------
I'm not sure MD5 is cryptographically sound either, so it probably makes sense to leave this FIXME in / update it.


================
Comment at: clang/lib/Frontend/CompilerInvocation.cpp:4580
 
-  return llvm::APInt(64, code).toString(36, /*Signed=*/false);
+  return llvm::APInt(64, Hash.getValue()).toString(36, /*Signed=*/false);
 }
----------------
Do we just want 64 bits, or do we want the full hash?


================
Comment at: clang/lib/Serialization/ModuleFileExtension.cpp:16-18
-llvm::hash_code ModuleFileExtension::hashExtension(llvm::hash_code Code) const {
-  return Code;
-}
----------------
Seems like this could stay in the `.cpp`, not sure if moving it to the header is important (or could be done separately to make the functionality changes more obvious).


================
Comment at: llvm/include/llvm/Support/VersionTuple.h:164-167
+template <typename HashT> void updateHash(HashT &Hash, const VersionTuple &V) {
+  Hash.update({V.getMajor(), V.getMinor().getValueOr(0),
+               V.getSubminor().getValueOr(0), V.getBuild().getValueOr(0)});
+}
----------------
Could this be declared as a friend, similar to `hash_value`?

Note that if the new builder code doesn't live in `llvm/`, probably this `updateHash` free function should be an implementation detail in the CompilerInvocation / ModuleHash code.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102943/new/

https://reviews.llvm.org/D102943



More information about the cfe-commits mailing list