[PATCH] D106910: [Support]: Introduce the `HashBuilder` interface.

Mon Aug 16 11:30:27 PDT 2021

arames marked 2 inline comments as done.
arames added inline comments.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:86-94
+  /// Implement hashing for hashable data types, e.g. integral, enum, or
+  /// floating-point values.
+  ///
+  /// The value is simply treated like a bag of bytes.
+  template <class T>
+  std::enable_if_t<hash_builder::detail::IsHashableData<T>::value>
+  update(T Value) {
----------------
dexonsmith wrote:
> Please use `makeArrayRef()` to avoid splitting the pointer from the size.
> 
> Also, I wonder if this should take into account endianness.
Updated to use `makeArrayRef()`.

Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class.
Supporting endianness looks pretty straight forward thanks to `Support/Endian.h`. Please take a look.
A couple questions:
1. I kept the default as `native` instead of `little` to avoid penalizing big-endian platforms. On the other hand defaulting to a stable hash may be safer. What do you think ?
2. To add support for user types, we end up with
```
template <typename HasherT, support::endianness Endianness>
void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value)
```
I think that is ok. Do you prefer "reserving" the `updateHash` name and going with
```
template <typename HashBuilderT>
void updateHash(HashBuilderT &HBuilder, const UserType &Value)
```
?

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:205-207
+  template <typename RangeT> void updateRange(const RangeT &Range) {
+    updateRange(adl_begin(Range), adl_end(Range));
+  }
----------------
dexonsmith wrote:
> I think it's important to have `updateRange` overloads for `ArrayRef` (at least for `uint8_t`) and `StringRef` that directly call `updateBytes` rather than iterating.
Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared.
Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ?

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:101-102
+  enum TestEnumeration { TE_One = 1, TE_Two = 2 };
+  volatile int VI = 71;
+  const volatile int CVI = 72;
+  double D = 123.0;
----------------
dexonsmith wrote:
> It's not clear to me why you're including volatile ints here. Is there any reason to expect them to have different behaviour?
I had borrowed the basic tests for `hash_code`, which did include it. I agree they bring no value here. Removed.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:26-36
+template <> auto finalHash<MD5>(MD5 &Hasher) {
+  MD5::MD5Result MD5Res;
+  Hasher.final(MD5Res);
+  return MD5Res.words();
+}
+
+template <> auto finalHash<SHA1>(SHA1 &Hasher) { return Hasher.final().str(); }
----------------
dexonsmith wrote:
> Can MD5 be updated in a prep patch to have an overload of `final` like this so this bridge code isn't needed?
https://reviews.llvm.org/D107781

This PR will have to be rebased once the prep is reviewed and lands.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:38-44
+template <typename HasherT, typename... Ts>
+static auto computeHash(const Ts &...Args) {
+  HasherT Hasher;
+  HashBuilder<HasherT> HBuilder(Hasher);
+  HBuilder.update(Args...);
+  return finalHash(HBuilder.Hasher);
+}
----------------
dexonsmith wrote:
> Seeing this boilerplate makes me wonder if the builder API could be improved with something like:
> - Add an `Optional<HasherT>` data member and a default constructor for `HashBuilder` that constructs it.
> - Make `update*` return `HashBuilder&`.
> - Add a `HashBuilder::final()` that returns `HasherT::final()` (after adding it to MD5 as above).
> 
> Then the above logic becomes a one-liner:
> ```
> lang=c++
> HashBuilder<HasherT>().update(Args...).final()
> ```
> (maybe still worth encapsulating in a function here for testing, but maybe not; not sure)
Done. That makes it more usable indeed.
I kept the helper to keep the tests concise.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:81-82
+  using MD5Words = std::pair<uint64_t, uint64_t>;
+  EXPECT_EQ(finalHash(Hasher),
+            MD5Words(0x7FE040B0B5311CE2ULL, 0xBE6766F347F32726ULL));
+}
----------------
dexonsmith wrote:
> It's not ideal to need to spell this out for each `HasherT`, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:
> - Unify the `HasherT` interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?).
> - Test each of these types in a standalone test in a unified `HashBuilderTest` below, dropping BasicTest/etc.
> - Add a helper function to the test fixture below to avoid boilerplate, something like:
> ```
> lang=c++
> template <class T> void checkHashableData(T Data) {
>   HasherT RawHasher;
>   auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...);
>   RawHasher.update(Bytes);
>   CHECK_EQ(HashBuilder<HasherT>().update(Data).final(),
>            RawHasher.final());
> }
> ```
I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before.
There is now a single `ReferenceHashCheck` handling all hasher types, with the three big, little, and native endiannesses.

This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms.
Other tests rely on equality or difference of different hashes.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106910/new/

https://reviews.llvm.org/D106910