[LLVMdev] We need better hashing

Wed Feb 15 12:01:04 PST 2012

On Feb 14, 2012, at 10:47 PM, Talin wrote:
>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }
> 
> This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.
> 
> So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.

Ah, Jay was right, I misread this code!

> Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.
> 
>   /// Add a float
>   void add(float Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
> 
>   /// Add a double
>   void add(double Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
> 
> Similarly, these need to go through a union to avoid TBAA problems.
> 
> I'm not sure how that works. Can you give an example?

Just use:

void add(double Data) {
  union {
     double D; uint64_t I;
  };
  D = Data;  
  add(I);
}

> 
>  void add(StringRef StrVal) {
>     addImpl(StrVal.data(), StrVal.size());
>   }
> 
> I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).
> 
> So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)

Ok, so what's the answer? :)   We can do different things for ArrayRef<char> and StringRef.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120215/d43c6f03/attachment.html>