[PATCH] D106910: [Support]: Introduce the `HashBuilder` interface.

Mon Aug 16 15:46:44 PDT 2021

dexonsmith added a comment.

I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.

Two high-level points, then more comments inline.

Firstly, I'm a bit concerned about the semantics of `update(ArrayRef)` (and `update(StringRef)` by proxy) due to inconsistency with `HasherT`.

- `HasherT::update(ArrayRef<uint8_t>)` hashes the raw bytes pointed to by the array.
- `HashBuilder<HasherT>::update(ArrayRef)` (including for `uint8_t`) hashes the size first, then the raw bytes.

This inconsistency could be confusing and error-prone.

I wonder if we should (mostly) avoid the word `update` in `HashBuilder` and use `add` instead. This also fixes the `updateRange` problem I point out inline (`addRange` works just fine without a preposition). We could leave `update()` around *just* for `ArrayRef<uint8_t>` (and maybe `StringRef`), forwarding to `addElements`/`addRangeElements` (see inline). The result would be:

- `HashBuilder<HasherT>::update` does exactly the same thing as `HasherT::update` (also working with StringRef, for convenience)
- `HashBuilder<HasherT>::add` (and `add*`) handles higher-level semantics, including primitives, endianness, ranges, etc.

Secondly, a base class that lacks the `Endianness` template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.

- `HashBuilderBase` can take just a `HasherT` parameter, and implement `update` (new meaning, matching `HasherT::update`), `final()`, and all other things that don't care about endianness.
- `HashBuilderImpl` layers the `add*` family on top and takes an `Endianness` parameter. But `native` is illegal here.
- `HashBuilder` canonicalizes `native` to `system_endianness()` to avoid code bloat, something like:

  template <typename HasherT, support::endianness Endianness>
  class HashBuilderImpl {
    static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
    // etc.
  };

  template <typename HasherT, support::Endianness Endianness = native,
                              support::Endianness EffectiveEndianness =
                                  Endianness == native
                                     ? support::endian::system_endianness()
                                     : Endianness>
  class HashBuilder
      : HashBuilderImpl<HasherT, EffectiveEndianness> {
    using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;

  public:
    HashBuilder() = default;
    HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
  };

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:80
+      : std::integral_constant<bool, is_integral_or_enum<U>::value ||
+                                         std::is_floating_point<U>::value> {};
+
----------------
I wonder if `std::is_floating_point` should be excluded.
- Float values might generally a bit iffy to be included in a stable hash, because of all the floating point modes that can affect their values.
- `long double` has a platform-dependent mantissa.

This would disable `HashBuilder::add`, forcing clients to think about what they actually want to do with it. WDYT?

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:83-84
+public:
+  Optional<HasherT> OptionalHasher;
+  HasherT &Hasher;
+
----------------
I think these should be private, and `Hasher` can have a `getHasher()` accessor. This'll make it easier to make HashBuilder move/copy-friendly in the future if for some reason that's useful.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:87-90
+  template <typename... ArgTypes>
+  explicit HashBuilder(ArgTypes &&...Args)
+      : OptionalHasher(in_place, std::forward<ArgTypes>(Args)...),
+        Hasher(*OptionalHasher) {}
----------------
Can this be simplified?
```
lang=c++
HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {}
```

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:96-98
+  auto final() {
+    return Hasher.final();
+  }
----------------
I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above).

I suggest the following to better align with the hashers themselves:
- Remove the template / enable_if. Require hashers to provide this.
- Return `StringRef`. Expect hashers to do the same.
- Add a `result()` method also returning StringRef, matching the one from hashers.

I also think an analogue for the static `hash()` that the hashers have would be useful, at least in the tests. Maybe:
```
lang=c++
using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>()));
static HashArrayT toArrayImpl(StringRef HashString) {
  assert(HashString.size() == sizeof(HashArrayT));
  HashArrayT HashArray;
  llvm::copy(HashString, HashArray.begin());
  return HashArray;
}

HashArrayT finalHash() { return toArrayImpl(Hasher.final()); }
HashArrayT resultHash() { return toArrayImpl(Hasher.result()); }
```

I *don't* think there should be a `HashBuilder::hash()` function (static or otherwise). Forwarding to `HashBuilder::add()` would be confusing / error-prone, since then `HashBuilder<HasherT>::hash(ArrayRef<uint8_t>)` would do something different from `HasherT::hash()`. Forwarding to `HasherT::hash` would also be confusing, and doesn't really add any value since clients can already do that themselves.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:100-107
+  /// Implement hashing for hashable data types, e.g. integral, enum, or
+  /// floating-point values.
+  template <typename T>
+  std::enable_if_t<IsHashableData<T>::value, HashBuilder &> update(T Value) {
+    Value = support::endian::byte_swap(Value, Endianness);
+    return updateBytes(
+        makeArrayRef(reinterpret_cast<const uint8_t *>(&Value), sizeof(Value)));
----------------
I wonder if the body of this should be exposed as a general helper for clients, using a weaker `std::enable_if` check. I.e.:
- Keep the signature and `std::enable_if` check for this overload of `add`.
- Move the body to a new function, called something like `addRawBytes`, which takes a `const T&` and has a weaker `std::enable_if` check (maybe `std::is_primitive`? or what does `endian::byte_swap` support?).
- Change `add` to call the new function.

WDYT?

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:275-277
+  template <typename RangeT> HashBuilder &updateRange(const RangeT &Range) {
+    return updateRange(adl_begin(Range), adl_end(Range));
+  }
----------------
I think there's something off with the naming. `updateRange` sounds like it is mutating the range itself, rather than updating the hash. Could add a preposition, or renaming to `add`.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:284
+  /// contraints.
+  HashBuilder &updateBytes(ArrayRef<uint8_t> Data) {
+    Hasher.update(Data);
----------------
Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container.

Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for `ArrayRef<uint8_t>` and `StringRef`. Name could be `addElements` or `addRangeElements` or something?

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:314-315
+    update(std::distance(First, Last));
+    for (auto It = First; It != Last; ++It)
+      update(*It);
+    return *this;
----------------
This can call `addRangeElements()`.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:319-323
+  template <typename T>
+  std::enable_if_t<IsHashableData<T>::value &&
+                       Endianness == support::endian::system_endianness(),
+                   HashBuilder &>
+  updateRangeImpl(T *First, T *Last, std::forward_iterator_tag) {
----------------
This overload can be dropped, but `addRangeElements` would want something similar.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:86-94
+  /// Implement hashing for hashable data types, e.g. integral, enum, or
+  /// floating-point values.
+  ///
+  /// The value is simply treated like a bag of bytes.
+  template <class T>
+  std::enable_if_t<hash_builder::detail::IsHashableData<T>::value>
+  update(T Value) {
----------------
arames wrote:
> dexonsmith wrote:
> > Please use `makeArrayRef()` to avoid splitting the pointer from the size.
> > 
> > Also, I wonder if this should take into account endianness.
> Updated to use `makeArrayRef()`.
> 
> Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class.
> Supporting endianness looks pretty straight forward thanks to `Support/Endian.h`. Please take a look.
> A couple questions:
> 1. I kept the default as `native` instead of `little` to avoid penalizing big-endian platforms. On the other hand defaulting to a stable hash may be safer. What do you think ?
> 2. To add support for user types, we end up with
> ```
> template <typename HasherT, support::endianness Endianness>
> void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value)
> ```
> I think that is ok. Do you prefer "reserving" the `updateHash` name and going with
> ```
> template <typename HashBuilderT>
> void updateHash(HashBuilderT &HBuilder, const UserType &Value)
> ```
> ?
> 
Just seeing this question.
- I don't think it's necessary to default to a stable hash. Clients can do that when it's useful. But if you're uncomfortable with unstable-by-default, another option is not to have a default, and then the client will need to spell out `support::native`, making it more clear that the hash is unstable. (It's easier to add a default later, but changing the default might not be easy.)
- I think the explicit parameter for `Endianness` makes it clear to authors they should be thinking about Endianness.
- There should be a name change to align with `add`. `addToHash`?

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:205-207
+  template <typename RangeT> void updateRange(const RangeT &Range) {
+    updateRange(adl_begin(Range), adl_end(Range));
+  }
----------------
arames wrote:
> dexonsmith wrote:
> > I think it's important to have `updateRange` overloads for `ArrayRef` (at least for `uint8_t`) and `StringRef` that directly call `updateBytes` rather than iterating.
> Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared.
> Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ?
> Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared.
> Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ?

Just seeing this question. Two reasons to do it explicitly:
- avoid compile-time overhead for this common case
- make it obvious what happens for anyone reading the code (as a common case, people are likely to care that it's doing the right thing)

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:53-56
+static auto computeHashForRange(const Ts &...Args) {
+  return static_cast<std::string>(
+      HashBuilder<HasherTAndEndianness>().updateRange(Args...).final());
+}
----------------
(Same comments as `computeHash`: drop std::string, and simplify or rename)

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:90
+
+template <typename HE, typename T> std::string hashRawData(T Data) {
+  using H = typename HE::HasherT;
----------------
This is also byte-swapping. I suggest `byteSwapAndHashRawData()`. Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:95-97
+  Hasher.update(llvm::makeArrayRef(
+      reinterpret_cast<const uint8_t *>(&SwappedData), sizeof(Data)));
+  return static_cast<std::string>(Hasher.final());
----------------
I (re-)discovered the static method `hash` when reviewing the other patch, which returns a `std::array`. I suggest using that instead, and directly returning `H::hash(SwappedData)`. I suggest forwarding the return array directly rather than wrapping in a `std::string`.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:100
+
+TYPED_TEST(HashBuilderTest, ReferenceHashCheck) {
+  using HE = TypeParam;
----------------
I think the name is wrong since the change. Maybe `addHashableData`?

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:108-109
+  if (std::is_same<H, llvm::MD5>::value) {
+    static char const *const Hashes[2] = {"43a75756bd3479b839815a6c43a3e3f7",
+                                          "e8e522c9723d8e6a100364c8fc58fd1d"};
+    EXPECT_EQ(Hash, Hashes[E]);
----------------
dexonsmith wrote:
> I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these data types separately, and rely on comparing against the result of the underlying Hasher (which we can assume is already well-tested elsewhere).
> 
> E.g.:
> ```
> lang=c++
> // helper for hashers
> template <class H, class T> auto hashRawData(T Data) {
>   H Hasher;
>   Hasher.update(makeRawDataArrayRef(makeArrayRef(
>       reinterpret_cast<uint8_t>(&Data), sizeof(Data)));
>   return H.final();
> };
> 
> // then in the test for raw integer data:
> 
> int I = 0x12345678;
> EXPECT_EQ(HashBuilder<H>.update(I).final(), hashRawData<H>(I));
> ```
> Could be each in their own test (one per type), or just in separate scopes, don't think it matters much.
> 
> Or if you think the magic numbers are better, can you walk me through your logic?
Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry).

I think `computeHash` can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add `finalHash`. But if you keep it, it should probably be renamed to `hashWithBuilder` to make it clear how the hash is being computed. That'd disambiguate the other one I think?

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:38-44
+template <typename HasherT, typename... Ts>
+static auto computeHash(const Ts &...Args) {
+  HasherT Hasher;
+  HashBuilder<HasherT> HBuilder(Hasher);
+  HBuilder.update(Args...);
+  return finalHash(HBuilder.Hasher);
+}
----------------
arames wrote:
> dexonsmith wrote:
> > Seeing this boilerplate makes me wonder if the builder API could be improved with something like:
> > - Add an `Optional<HasherT>` data member and a default constructor for `HashBuilder` that constructs it.
> > - Make `update*` return `HashBuilder&`.
> > - Add a `HashBuilder::final()` that returns `HasherT::final()` (after adding it to MD5 as above).
> > 
> > Then the above logic becomes a one-liner:
> > ```
> > lang=c++
> > HashBuilder<HasherT>().update(Args...).final()
> > ```
> > (maybe still worth encapsulating in a function here for testing, but maybe not; not sure)
> Done. That makes it more usable indeed.
> I kept the helper to keep the tests concise.
If you take my suggestion to add `finalHash`, `computeHash` can be further simplified to:
```
lang=c++
return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash();
```
removing the static cast.

IMO, inlining it would improve clarity, but renaming it to `hashWithBuilder` would also help.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:81-82
+  using MD5Words = std::pair<uint64_t, uint64_t>;
+  EXPECT_EQ(finalHash(Hasher),
+            MD5Words(0x7FE040B0B5311CE2ULL, 0xBE6766F347F32726ULL));
+}
----------------
dexonsmith wrote:
> arames wrote:
> > dexonsmith wrote:
> > > It's not ideal to need to spell this out for each `HasherT`, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:
> > > - Unify the `HasherT` interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?).
> > > - Test each of these types in a standalone test in a unified `HashBuilderTest` below, dropping BasicTest/etc.
> > > - Add a helper function to the test fixture below to avoid boilerplate, something like:
> > > ```
> > > lang=c++
> > > template <class T> void checkHashableData(T Data) {
> > >   HasherT RawHasher;
> > >   auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...);
> > >   RawHasher.update(Bytes);
> > >   CHECK_EQ(HashBuilder<HasherT>().update(Data).final(),
> > >            RawHasher.final());
> > > }
> > > ```
> > I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before.
> > There is now a single `ReferenceHashCheck` handling all hasher types, with the three big, little, and native endiannesses.
> > 
> > This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms.
> > Other tests rely on equality or difference of different hashes.
> I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See further down in ReferenceHashCheck.
Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning:
> In my opinion it is valuable to ensure the stability of hashes across platforms.
I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively.

Meanwhile, the following important trait was *not* being tested: do `HashBuilder<HashT>` and `HashT` give the same result as each other for each of these raw data types? (Since fixed!)

You could have both (the above, plus the magic number test). It would uncover things such as `double` having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106910/new/

https://reviews.llvm.org/D106910