[LLVMdev] We need better hashing

Chris Lattner clattner at apple.com
Tue Feb 14 02:44:36 PST 2012


On Feb 13, 2012, at 2:00 AM, Talin wrote:
> Just out of curiosity, why not MurmurHash3 ? This page seems to
> suggest that #2 has some flaw, and #3 is better all round:
> 
> https://sites.google.com/site/murmurhash/
> 
> The main reason is because there's no incremental version of 3.

I think that that is a great reason.

> LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed.  It can't be hard to be better than our existing adhoc stuff :).  That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor.  The identifier uniquing is a very hot path in "clang -E" performance.

> 
> Would it be possible to use CityHash instead for strings?
> 
> http://code.google.com/p/cityhash/
> 
> OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.
> 
> It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it.  I personally care more about the interface than the implementation.  Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.


#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"

Do you actually need all of these includes?  PointerLikeTypeTraits doesn't seem necessary.  Is type_traits?

  enum {
    BufferSize = 32,

BufferSize is dead.


 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.



 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

  // Add a possibly unaligned sequence of bytes.
  void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

Overall, the design of the class is making sense to me!  Thanks for tackling this!

-Chris


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/eca27ace/attachment.html>


More information about the llvm-dev mailing list