[LLVMdev] We need better hashing

Fri Feb 17 01:32:20 PST 2012

On Feb 17, 2012, at 12:26 AM, Talin wrote:
> OK here's a patch with the latest, including unit tests. I've also tried to make the comments clear that this is intended for the case of "generic" key types, where you are either hashing multiple data types together, or you don't know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this.

Makes sense.

+  /// Add a pointer value.
+  /// Note: this adds pointers to the hash using sizes and endianness that
+  /// depend on the host.  It doesn't matter however, because hashing on
+  /// pointer values is inherently unstable.

This makes perfect sense.

+  /// Add an ArrayRef of arbitrary data.
+  template<typename T>
+  GeneralHash& add(ArrayRef<T> ArrayVal) {
+    addBits(ArrayVal.begin(), ArrayVal.end());
+    return *this;
+  }

Doesn't this have the same host-specificity problem, except that it will cause things that *are* stable to vary, such as arrays of char, or is the alignment check enough?

+  /// Add a float
+  GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and +0.0 will hash to different values even though they compare equal and two identical NaN's will hash to the same value even though they compare unequal.

The mix logic is inherently a series of 32-bit operations.  Would it be possible and make more sense to implement them as 64-bit operations?  64-bit hosts are the vast majority of the machines that run a compiler these days.  OTOH, making it depend on the host brings up host instability issues.  

Actually, how much do we care about host instability here?  If this is used by hashing containers, we can just declare that iteration order is undefined even for non-pointer values.  The real problem I guess is that we'd have to go check out all the existing DenseMap's etc to make sure people aren't doing well-defined iteration right now and fixing the code.  What do you think?

> There's a couple of things I don't like: First, there's too many levels of inlined function calls - my experience is that compilers aren't as good at inlining nested calls as is often advertised, however I couldn't figure out a way to get the same effect without the nested calls.

Is there a specific observed concern here (e.g. when built with Clang) or is this a historical problem?  Compilers have gotten much better about inlining stuff that is actually small, if Clang handles it then I think we're good.  Marking these attribute(always_inline) is massive overkill.

Overall the code is looking great.  I'd recommend checking in the new API separately from switching Constants.cpp to use it though (just so that any problems doesn't cause both to get reverted).

-Chris