[PATCH] D106910: [Support]: Introduce the `HashBuilder` interface.

Tue Aug 3 15:08:19 PDT 2021

dexonsmith requested changes to this revision.
dexonsmith added a comment.
This revision now requires changes to proceed.

Updates look great; a few more comments inline.

Since the ultimate goal is to make it easy for higher-level logic to use a stable hash across different architectures, I realize that endianness could be important too for the hashable-data overloads. Have you considered what to do there?

One idea: add an endianness template parameter to HashBuilder:

  template <class HasherT, support::endianness Endianness = support::native>
  class HashBuilder;

then change the `update` overloads for hashable data to hash the result of `support::endian::byte_swap(V, Endianness)`. In the common case (`native`) the copy/etc should get optimized out. WDYT?

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:86-94
+  /// Implement hashing for hashable data types, e.g. integral, enum, or
+  /// floating-point values.
+  ///
+  /// The value is simply treated like a bag of bytes.
+  template <class T>
+  std::enable_if_t<hash_builder::detail::IsHashableData<T>::value>
+  update(T Value) {
----------------
Please use `makeArrayRef()` to avoid splitting the pointer from the size.

Also, I wonder if this should take into account endianness.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:195
+  typename std::enable_if<(sizeof...(Ts) > 1)>::type update(const Ts &...Args) {
+    std::tuple<const Ts &...> Unused{(update(Args), Args)...};
+  }
----------------
I think you can drop the name here too.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:205-207
+  template <typename RangeT> void updateRange(const RangeT &Range) {
+    updateRange(adl_begin(Range), adl_end(Range));
+  }
----------------
I think it's important to have `updateRange` overloads for `ArrayRef` (at least for `uint8_t`) and `StringRef` that directly call `updateBytes` rather than iterating.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:220-224
+  /// Update the hash with a "bag of bytes". This does not rely on `HasherT` to
+  /// take `Size` into account.
+  void updateBytes(const uint8_t *Ptr, size_t Size) {
+    updateBytes(ArrayRef<uint8_t>(Ptr, Size));
+  }
----------------
I don't think we should add this overload. The rare call site can call `makeArrayRef` itself, encouraging it to push ArrayRef further up the stack.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:26-36
+template <> auto finalHash<MD5>(MD5 &Hasher) {
+  MD5::MD5Result MD5Res;
+  Hasher.final(MD5Res);
+  return MD5Res.words();
+}
+
+template <> auto finalHash<SHA1>(SHA1 &Hasher) { return Hasher.final().str(); }
----------------
Can MD5 be updated in a prep patch to have an overload of `final` like this so this bridge code isn't needed?

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:38-44
+template <typename HasherT, typename... Ts>
+static auto computeHash(const Ts &...Args) {
+  HasherT Hasher;
+  HashBuilder<HasherT> HBuilder(Hasher);
+  HBuilder.update(Args...);
+  return finalHash(HBuilder.Hasher);
+}
----------------
Seeing this boilerplate makes me wonder if the builder API could be improved with something like:
- Add an `Optional<HasherT>` data member and a default constructor for `HashBuilder` that constructs it.
- Make `update*` return `HashBuilder&`.
- Add a `HashBuilder::final()` that returns `HasherT::final()` (after adding it to MD5 as above).

Then the above logic becomes a one-liner:
```
lang=c++
HashBuilder<HasherT>().update(Args...).final()
```
(maybe still worth encapsulating in a function here for testing, but maybe not; not sure)

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:81-82
+  using MD5Words = std::pair<uint64_t, uint64_t>;
+  EXPECT_EQ(finalHash(Hasher),
+            MD5Words(0x7FE040B0B5311CE2ULL, 0xBE6766F347F32726ULL));
+}
----------------
It's not ideal to need to spell this out for each `HasherT`, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:
- Unify the `HasherT` interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?).
- Test each of these types in a standalone test in a unified `HashBuilderTest` below, dropping BasicTest/etc.
- Add a helper function to the test fixture below to avoid boilerplate, something like:
```
lang=c++
template <class T> void checkHashableData(T Data) {
  HasherT RawHasher;
  auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...);
  RawHasher.update(Bytes);
  CHECK_EQ(HashBuilder<HasherT>().update(Data).final(),
           RawHasher.final());
}
```

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:102
+template <typename HasherT>
+class TypedHashBuilderTest : public testing::Test {};
+using HasherTypes = ::testing::Types<MD5, SHA1, SHA256>;
----------------
I suggest dropping the `Typed` and calling this simply `HashBuilderTest`.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106910/new/

https://reviews.llvm.org/D106910