[PATCH] D106910: [Support]: Introduce the `HashBuilder` interface.

Mon Aug 23 10:44:29 PDT 2021

arames marked 18 inline comments as done.
arames added a comment.

Thanks - again - for the comments. Please take a look at the update.

The major changes are around:

- Renaming `update*` to `add*`
- Refactoring `HashBuilder` into `HashBuilderBase`, `HashBuilderImpl`, and `HashBuilder`.

However I am pushing back on a number of suggestions that would impose requirement on the hasher type. I think there is a lot of value in keeping the interface trivial to use for user-defined/other hasher types.
That includes the suggested `finalHash`, and supporting ctors with arguments, and using `enable_if` to conditionally forward methods like `final()` and `result()`, so that we can rely on the single `HasherT::update(ArrayRef<uint8_t>)` method. I think that is a fair implementation cost for the value it brings.
I have added a test for a custom minimal hasher.

Let me know what you think. I'm always open to discuss and update.

Cheers!

In D106910#2948072 <https://reviews.llvm.org/D106910#2948072>, @dexonsmith wrote:

> I think this is getting close... I've found a few new things, and I apologize for not seeing them sooner.
>
> Two high-level points, then more comments inline.
>
> Firstly, I'm a bit concerned about the semantics of `update(ArrayRef)` (and `update(StringRef)` by proxy) due to inconsistency with `HasherT`.
>
> - `HasherT::update(ArrayRef<uint8_t>)` hashes the raw bytes pointed to by the array.
> - `HashBuilder<HasherT>::update(ArrayRef)` (including for `uint8_t`) hashes the size first, then the raw bytes.
>
> This inconsistency could be confusing and error-prone.
>
> I wonder if we should (mostly) avoid the word `update` in `HashBuilder` and use `add` instead. This also fixes the `updateRange` problem I point out inline (`addRange` works just fine without a preposition). We could leave `update()` around *just* for `ArrayRef<uint8_t>` (and maybe `StringRef`), forwarding to `addElements`/`addRangeElements` (see inline). The result would be:
>
> - `HashBuilder<HasherT>::update` does exactly the same thing as `HasherT::update` (also working with StringRef, for convenience)
> - `HashBuilder<HasherT>::add` (and `add*`) handles higher-level semantics, including primitives, endianness, ranges, etc.

I did not consider the issue. But I like the proposal, as it removes potential confusions.

> Secondly, a base class that lacks the `Endianness` template parameter should probably be factored out to avoid bloating code once this starts getting used. I think there should be three layers.
>
> - `HashBuilderBase` can take just a `HasherT` parameter, and implement `update` (new meaning, matching `HasherT::update`), `final()`, and all other things that don't care about endianness.
> - `HashBuilderImpl` layers the `add*` family on top and takes an `Endianness` parameter. But `native` is illegal here.
> - `HashBuilder` canonicalizes `native` to `system_endianness()` to avoid code bloat, something like:

Done.
Doing so requires templated CRTP, as we need to return the reference to our derived class and keep track of `HasherT`/endianness. It looks slightly convoluted at the template declaration site, but is neat otherwise otherwise.

>   template <typename HasherT, support::endianness Endianness>
>   class HashBuilderImpl {
>     static_assert(Endianness != native, "HashBuilder should canonicalize endianness");
>     // etc.
>   };
>   
>   template <typename HasherT, support::Endianness Endianness = native,
>                               support::Endianness EffectiveEndianness =
>                                   Endianness == native
>                                      ? support::endian::system_endianness()
>                                      : Endianness>
>   class HashBuilder
>       : HashBuilderImpl<HasherT, EffectiveEndianness> {
>     using HashBuilderImplT = HashBuilderImpl<HasherT, EffectiveEndianness>;
>   
>   public:
>     HashBuilder() = default;
>     HashBuilder(HasherT &Hasher) : HashBuilderImplT(Hasher) {}
>   };

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:80
+      : std::integral_constant<bool, is_integral_or_enum<U>::value ||
+                                         std::is_floating_point<U>::value> {};
+
----------------
dexonsmith wrote:
> I wonder if `std::is_floating_point` should be excluded.
> - Float values might generally a bit iffy to be included in a stable hash, because of all the floating point modes that can affect their values.
> - `long double` has a platform-dependent mantissa.
> 
> This would disable `HashBuilder::add`, forcing clients to think about what they actually want to do with it. WDYT?
That sounds reasonable to me.
Forcing the hash would still be easy via other scalar types.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:87-90
+  template <typename... ArgTypes>
+  explicit HashBuilder(ArgTypes &&...Args)
+      : OptionalHasher(in_place, std::forward<ArgTypes>(Args)...),
+        Hasher(*OptionalHasher) {}
----------------
dexonsmith wrote:
> Can this be simplified?
> ```
> lang=c++
> HashBuilder() : OptionalHasher(in_place), Hasher(*OptionalHasher) {}
> ```
See my top-level comment.
I think we should keep support for ctors with arguments (e.g. for a seed).

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:96-98
+  auto final() {
+    return Hasher.final();
+  }
----------------
dexonsmith wrote:
> I made some comments on the patch this one depends on (https://reviews.llvm.org/D107781 -- FYI, I linked them using the add parent/child revision button above).
> 
> I suggest the following to better align with the hashers themselves:
> - Remove the template / enable_if. Require hashers to provide this.
> - Return `StringRef`. Expect hashers to do the same.
> - Add a `result()` method also returning StringRef, matching the one from hashers.
> 
> I also think an analogue for the static `hash()` that the hashers have would be useful, at least in the tests. Maybe:
> ```
> lang=c++
> using HashArrayT = decltype(HasherT::hash(ArrayRef<uint8_t>()));
> static HashArrayT toArrayImpl(StringRef HashString) {
>   assert(HashString.size() == sizeof(HashArrayT));
>   HashArrayT HashArray;
>   llvm::copy(HashString, HashArray.begin());
>   return HashArray;
> }
> 
> HashArrayT finalHash() { return toArrayImpl(Hasher.final()); }
> HashArrayT resultHash() { return toArrayImpl(Hasher.result()); }
> ```
> 
> I *don't* think there should be a `HashBuilder::hash()` function (static or otherwise). Forwarding to `HashBuilder::add()` would be confusing / error-prone, since then `HashBuilder<HasherT>::hash(ArrayRef<uint8_t>)` would do something different from `HasherT::hash()`. Forwarding to `HasherT::hash` would also be confusing, and doesn't really add any value since clients can already do that themselves.
Added a forward to `result()`.

Per my top-level reply, I think we should keep the `final()` and `result()` forwards conditional on their existence in `HasherT`, to keep the use of this interface with a custom hasher type trivial.

I am not convinced by the `finalHash()` and `resultHash()` helpers should be members of `HashBuilder`. I may be missing context on how it is used, so feel free to push if you think the use-case is really valuable.
I would assume that the return type of `HasherT::final()` is appropriate to represent the hash. It seems bothersome to have to provide these helpers in the interface, making additional assumptions (or more SFINAE code) about the hasher type.

Looking at other related comments, I think a lot of this discussion stems from the fact that `MD5`, `SHA1`, and `SHA256` `final()` methods return a reference (`StringRef`).
It seems to me this discussion would be simplified if there was a method like `finalByValue()`.

My suggestion would be to keep relying on the return type for `final()`, and have "type converting" wrappers implemented outside of `HasherT` or `HashBuilder` for now.
If you like the `finalByValue()` idea, I could do it in a successor patch.

On this topic, I find the design of the `::hash` functions join on the issue. Being `static`, they return a result by value. Though both strings and `array<uint8_t>` are of course pretty close, I find `std::array` is a bit surprising choice when `final()` and `result()` return `StringRef`. So the question will be, what should be the return type for `finalByValue()` for these hashes. `std::array` like `hash()`, or a `SmallString` maybe.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:100-107
+  /// Implement hashing for hashable data types, e.g. integral, enum, or
+  /// floating-point values.
+  template <typename T>
+  std::enable_if_t<IsHashableData<T>::value, HashBuilder &> update(T Value) {
+    Value = support::endian::byte_swap(Value, Endianness);
+    return updateBytes(
+        makeArrayRef(reinterpret_cast<const uint8_t *>(&Value), sizeof(Value)));
----------------
dexonsmith wrote:
> I wonder if the body of this should be exposed as a general helper for clients, using a weaker `std::enable_if` check. I.e.:
> - Keep the signature and `std::enable_if` check for this overload of `add`.
> - Move the body to a new function, called something like `addRawBytes`, which takes a `const T&` and has a weaker `std::enable_if` check (maybe `std::is_primitive`? or what does `endian::byte_swap` support?).
> - Change `add` to call the new function.
> 
> WDYT?
Sounds good.

Done, with enablement automatically whenever the type is handled by `support::endian::byte_swap()`.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:284
+  /// contraints.
+  HashBuilder &updateBytes(ArrayRef<uint8_t> Data) {
+    Hasher.update(Data);
----------------
dexonsmith wrote:
> Similar naming problem here (I'm pretty sure it's my fault), but worse, since I now realize there's nothing inherent about the word "bytes" that implies the size of the range is skipped. Someone could have a container of bytes, and expect that this encoded the container.
> 
> Stepping back, I feel like this function -- skipping the size of a range -- could be more generally useful to clients. I wonder if it should be exposed for generic ranges as well, keeping specializations for `ArrayRef<uint8_t>` and `StringRef`. Name could be `addElements` or `addRangeElements` or something?
I like it. Done.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:314-315
+    update(std::distance(First, Last));
+    for (auto It = First; It != Last; ++It)
+      update(*It);
+    return *this;
----------------
dexonsmith wrote:
> This can call `addRangeElements()`.
This has become `addRangeElementsImpl`. Now, `addRange` simply does `add(size); addRangeElements(...);`.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:319-323
+  template <typename T>
+  std::enable_if_t<IsHashableData<T>::value &&
+                       Endianness == support::endian::system_endianness(),
+                   HashBuilder &>
+  updateRangeImpl(T *First, T *Last, std::forward_iterator_tag) {
----------------
dexonsmith wrote:
> This overload can be dropped, but `addRangeElements` would want something similar.
This has become `addRangeElementsImpl`.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:86-94
+  /// Implement hashing for hashable data types, e.g. integral, enum, or
+  /// floating-point values.
+  ///
+  /// The value is simply treated like a bag of bytes.
+  template <class T>
+  std::enable_if_t<hash_builder::detail::IsHashableData<T>::value>
+  update(T Value) {
----------------
dexonsmith wrote:
> arames wrote:
> > dexonsmith wrote:
> > > Please use `makeArrayRef()` to avoid splitting the pointer from the size.
> > > 
> > > Also, I wonder if this should take into account endianness.
> > Updated to use `makeArrayRef()`.
> > 
> > Great point. I didn't pay attention to it, but taking it into account will significantly improve the value of this class.
> > Supporting endianness looks pretty straight forward thanks to `Support/Endian.h`. Please take a look.
> > A couple questions:
> > 1. I kept the default as `native` instead of `little` to avoid penalizing big-endian platforms. On the other hand defaulting to a stable hash may be safer. What do you think ?
> > 2. To add support for user types, we end up with
> > ```
> > template <typename HasherT, support::endianness Endianness>
> > void updateHash(HashBuilder<HasherT, Endianness> &HBuilder, const UserType &Value)
> > ```
> > I think that is ok. Do you prefer "reserving" the `updateHash` name and going with
> > ```
> > template <typename HashBuilderT>
> > void updateHash(HashBuilderT &HBuilder, const UserType &Value)
> > ```
> > ?
> > 
> Just seeing this question.
> - I don't think it's necessary to default to a stable hash. Clients can do that when it's useful. But if you're uncomfortable with unstable-by-default, another option is not to have a default, and then the client will need to spell out `support::native`, making it more clear that the hash is unstable. (It's easier to add a default later, but changing the default might not be easy.)
> - I think the explicit parameter for `Endianness` makes it clear to authors they should be thinking about Endianness.
> - There should be a name change to align with `add`. `addToHash`?
I'll drop the default value for `Endianness`. The template is simple enough for the rare use-cases, and simple enough to wrap if the lack of default is ever a problem for a user.

================
Comment at: llvm/include/llvm/Support/HashBuilder.h:205-207
+  template <typename RangeT> void updateRange(const RangeT &Range) {
+    updateRange(adl_begin(Range), adl_end(Range));
+  }
----------------
dexonsmith wrote:
> arames wrote:
> > dexonsmith wrote:
> > > I think it's important to have `updateRange` overloads for `ArrayRef` (at least for `uint8_t`) and `StringRef` that directly call `updateBytes` rather than iterating.
> > Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared.
> > Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ?
> > Both `ArrayRef` and `StringRef` do end up calling the version of `updateRangeImpl` with `updateBytes`, due to how their `begin` and `end` are declared.
> > Do you mean to do it explicitly, in case the `ArrayRef` implementation changes ?
> 
> Just seeing this question. Two reasons to do it explicitly:
> - avoid compile-time overhead for this common case
> - make it obvious what happens for anyone reading the code (as a common case, people are likely to care that it's doing the right thing)
Done, with comments.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:53-56
+static auto computeHashForRange(const Ts &...Args) {
+  return static_cast<std::string>(
+      HashBuilder<HasherTAndEndianness>().updateRange(Args...).final());
+}
----------------
dexonsmith wrote:
> (Same comments as `computeHash`: drop std::string, and simplify or rename)
Renamed to `hashRangeWithBuilder`.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:90
+
+template <typename HE, typename T> std::string hashRawData(T Data) {
+  using H = typename HE::HasherT;
----------------
dexonsmith wrote:
> This is also byte-swapping. I suggest `byteSwapAndHashRawData()`. Maybe even add "WithoutBuilder"? Or, since this is only used in one function, you could make it a lambda.
Done. I was not familiar with using lambdas with `auto` for that type of use-case.
Also renamed it to `ByteSwapAndHashWithHasher` for clarity.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:95-97
+  Hasher.update(llvm::makeArrayRef(
+      reinterpret_cast<const uint8_t *>(&SwappedData), sizeof(Data)));
+  return static_cast<std::string>(Hasher.final());
----------------
dexonsmith wrote:
> I (re-)discovered the static method `hash` when reviewing the other patch, which returns a `std::array`. I suggest using that instead, and directly returning `H::hash(SwappedData)`. I suggest forwarding the return array directly rather than wrapping in a `std::string`.
I agree here a single call would be nice.
The change would require the update of previous helpers to return `decltype(H::hash)`, that I am reluctant to do.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:108-109
+  if (std::is_same<H, llvm::MD5>::value) {
+    static char const *const Hashes[2] = {"43a75756bd3479b839815a6c43a3e3f7",
+                                          "e8e522c9723d8e6a100364c8fc58fd1d"};
+    EXPECT_EQ(Hash, Hashes[E]);
----------------
dexonsmith wrote:
> dexonsmith wrote:
> > I'm not such a fan of this reference hash / magic value idea. I'd prefer to check each of these data types separately, and rely on comparing against the result of the underlying Hasher (which we can assume is already well-tested elsewhere).
> > 
> > E.g.:
> > ```
> > lang=c++
> > // helper for hashers
> > template <class H, class T> auto hashRawData(T Data) {
> >   H Hasher;
> >   Hasher.update(makeRawDataArrayRef(makeArrayRef(
> >       reinterpret_cast<uint8_t>(&Data), sizeof(Data)));
> >   return H.final();
> > };
> > 
> > // then in the test for raw integer data:
> > 
> > int I = 0x12345678;
> > EXPECT_EQ(HashBuilder<H>.update(I).final(), hashRawData<H>(I));
> > ```
> > Could be each in their own test (one per type), or just in separate scopes, don't think it matters much.
> > 
> > Or if you think the magic numbers are better, can you walk me through your logic?
> Update here looks like what I was asking for, but it's actually not clear at all how these functions are different (I probably suggested a bad name I think, sorry).
> 
> I think `computeHash` can be inlined, avoiding any ambiguity about what it does, if you take my suggestion to add `finalHash`. But if you keep it, it should probably be renamed to `hashWithBuilder` to make it clear how the hash is being computed. That'd disambiguate the other one I think?
This ended up with explicit names `ByteSwapAndHashWithHasher` and `hashWithBuilder`.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:38-44
+template <typename HasherT, typename... Ts>
+static auto computeHash(const Ts &...Args) {
+  HasherT Hasher;
+  HashBuilder<HasherT> HBuilder(Hasher);
+  HBuilder.update(Args...);
+  return finalHash(HBuilder.Hasher);
+}
----------------
dexonsmith wrote:
> arames wrote:
> > dexonsmith wrote:
> > > Seeing this boilerplate makes me wonder if the builder API could be improved with something like:
> > > - Add an `Optional<HasherT>` data member and a default constructor for `HashBuilder` that constructs it.
> > > - Make `update*` return `HashBuilder&`.
> > > - Add a `HashBuilder::final()` that returns `HasherT::final()` (after adding it to MD5 as above).
> > > 
> > > Then the above logic becomes a one-liner:
> > > ```
> > > lang=c++
> > > HashBuilder<HasherT>().update(Args...).final()
> > > ```
> > > (maybe still worth encapsulating in a function here for testing, but maybe not; not sure)
> > Done. That makes it more usable indeed.
> > I kept the helper to keep the tests concise.
> If you take my suggestion to add `finalHash`, `computeHash` can be further simplified to:
> ```
> lang=c++
> return HashBuilder<HasherTAndEndianness>().add(Args...).finalHash();
> ```
> removing the static cast.
> 
> IMO, inlining it would improve clarity, but renaming it to `hashWithBuilder` would also help.
Per my other comment, I am not convinced by the `finalHash()` method.
Renamed to `hashWithBuilder`.

================
Comment at: llvm/unittests/Support/HashBuilderTest.cpp:81-82
+  using MD5Words = std::pair<uint64_t, uint64_t>;
+  EXPECT_EQ(finalHash(Hasher),
+            MD5Words(0x7FE040B0B5311CE2ULL, 0xBE6766F347F32726ULL));
+}
----------------
dexonsmith wrote:
> dexonsmith wrote:
> > arames wrote:
> > > dexonsmith wrote:
> > > > It's not ideal to need to spell this out for each `HasherT`, and hardcoding the magic numbers doesn't seem quite right either. I suggest instead:
> > > > - Unify the `HasherT` interfaces as necessary in prep commits to allow using them generically (maybe my other suggestions above are sufficient for that?).
> > > > - Test each of these types in a standalone test in a unified `HashBuilderTest` below, dropping BasicTest/etc.
> > > > - Add a helper function to the test fixture below to avoid boilerplate, something like:
> > > > ```
> > > > lang=c++
> > > > template <class T> void checkHashableData(T Data) {
> > > >   HasherT RawHasher;
> > > >   auto Bytes = makeArrayRef(reinterpret_cast<uint8_t *>(&Data),...);
> > > >   RawHasher.update(Bytes);
> > > >   CHECK_EQ(HashBuilder<HasherT>().update(Data).final(),
> > > >            RawHasher.final());
> > > > }
> > > > ```
> > > I have cleaned this up significantly as part of the endianness support. It does not look like what you suggest, but I think it is much better than before.
> > > There is now a single `ReferenceHashCheck` handling all hasher types, with the three big, little, and native endiannesses.
> > > 
> > > This is the only test that verifies against a hard-coded reference hash. In my opinion it is valuable to ensure the stability of hashes across platforms.
> > > Other tests rely on equality or difference of different hashes.
> > I'm still seeing magic numbers, although this comment seems to be in the wrong place now. See further down in ReferenceHashCheck.
> Looks like you've adjusted the tests in the meantime, but since I hadn't seen this comment before, just wanted to give some more reasoning:
> > In my opinion it is valuable to ensure the stability of hashes across platforms.
> I agree that's a useful trait, but the tests for MD5, SHA1, and SHA256 should be doing that, respectively.
> 
> Meanwhile, the following important trait was *not* being tested: do `HashBuilder<HashT>` and `HashT` give the same result as each other for each of these raw data types? (Since fixed!)
> 
> You could have both (the above, plus the magic number test). It would uncover things such as `double` having a different representation on different platforms, or some platforms being big- or little-endian. Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...
I did adjust the tests following your suggestion.

> Those both seem important, but IMO probably issues that clients of HashBuilder need to be aware of, rather than HashBuilder. But maybe there's some room for debate...

I agree. We would not want to start seeing the `HashBuilder` tests failing across different platforms because of reasons that are known (and maybe already correctly handled). So let's correctly separate the responsibilities.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D106910/new/

https://reviews.llvm.org/D106910