[LLVMdev] We need better hashing

Tue Feb 14 22:47:54 PST 2012

On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com> wrote:

> On Feb 13, 2012, at 2:00 AM, Talin wrote:
>
> Just out of curiosity, why not MurmurHash3 ? This page seems to
>> suggest that #2 has some flaw, and #3 is better all round:
>>
>> https://sites.google.com/site/murmurhash/
>>
>> The main reason is because there's no incremental version of 3.
>
>
> I think that that is a great reason.
>
> LLVM's needs, on the other hand, are fairly modest. I'm guessing that most
> DenseMaps contain less than a few thousand entries. Even a bad hash
> function wouldn't hurt performance that much, and the time taken to
> calculate the hash is probably more of a factor in overall performance than
> getting the optimum distribution of hash values.
>
>
> Indeed.  It can't be hard to be better than our existing adhoc stuff :).
>  That said, please do not change the hash function used by StringMap
> without do careful performance analysis of the clang preprocessor.  The
> identifier uniquing is a very hot path in "clang -E" performance.
>
> I'm not planning on touching StringMap.

>
>> Would it be possible to use CityHash instead for strings?
>>
>> http://code.google.com/p/cityhash/
>>
>> OK by me. My intention is that "Hashing.h" should contain a variety of
> different hashing algorithms for various specific needs. Right now, I am
> mainly focusing on hashing of mixed data types - that is, you have some
> structure which contains pointers, ints, strings, and you want to calculate
> a hash of the entire struct. I need this because I'm working on improving
> the uniquing of constants and similar data structures. My next target is to
> improve the uniquing of MDNodes, but I want to get the hashing stuff
> squared away first.
>
> It is also my intent that some person who is more of an expert in hashing
> than I am can do detailed performance analysis under real-world loads (such
> as compiling actual programs with clang) and tweak and optimize the hashing
> algorithm, without needing to modify the API and/or all of the places that
> call it.
>
>
> I think that this is a great way to stage it.  I personally care more
> about the interface than the implementation.  Someone can tweak and tune it
> after enough code is using the API to get reasonable performance numbers.
>
>
> #include "llvm/ADT/ArrayRef.h"
> #include "llvm/ADT/StringRef.h"
> #include "llvm/Support/Compiler.h"
> #include "llvm/Support/DataTypes.h"
> #include "llvm/Support/PointerLikeTypeTraits.h"
> #include "llvm/Support/type_traits.h"
>
> Do you actually need all of these includes?  PointerLikeTypeTraits doesn't
> seem necessary.  Is type_traits?
>
> Ooops, this was a cut & paste error from FoldingSet.cpp.

>   enum {
>     BufferSize = 32,
>
> BufferSize is dead.
>
>
>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }
>
> This violates TBAA rules and looks pretty dangerous to expose as public
> API.  Is this really needed?  Also, addImpl is dereferencing the pointers
> as uint32_t's, but there is nothing that guarantees that T is a multiple of
> 4 bytes.  The ArrayRef version has the same problem.
>
> So as far as hashing pointer values is concerned, I was just copying the
code from FoldingSet. Since a lot of the keys that we're going to be
dealing with are uniqued pointers, it makes sense to be able to calculate a
hash of the bit-value of the pointer, rather than hashing the thing pointed
to. That being said, renaming it to "addPointer" and adding a comment might
be in order. Similarly, I can make the ArrayRef version 'addPointers' and
have it take an ArrayRef<T*>.

Now, as to the 4 bytes issue, I think I can solve that with some clever
template methods.

> Though it is more verbose, I think it would be better to expose a template
> specialization approach to getting the hash_value of T.
>
>   /// Add a float
>   void add(float Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
>   /// Add a double
>   void add(double Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
> Similarly, these need to go through a union to avoid TBAA problems.
>
> I'm not sure how that works. Can you give an example?

>
>  void add(StringRef StrVal) {
>     addImpl(StrVal.data(), StrVal.size());
>   }
>
> I'm contradicting my stance above about not caring about the
> implementation :), but is MurmurHash a good hash for string data?
>  The Bernstein hash function works really well and is much cheaper to
> compute than Murmur.  It is used by HashString (and thus by StringMap).
>
> So, MurmurHash is intended for blocks of arbitrary binary data, which may
contain character data, integers, or whatever - it's designed to do such a
thorough job of mixing the bits that it really doesn't matter what data
types you feed it. You are correct that for purely string data, you'd want
to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap
as the Bernstein hash and is AFAIK more mathematically sound.)

>   // Add a possibly unaligned sequence of bytes.
>   void addImpl(const char *I, size_t Length) {
>
> This should probably be moved out of line to avoid code bloat.
>

OK

>
> Overall, the design of the class is making sense to me!  Thanks for
> tackling this!
>
> -Chris
>
>
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/4dc54ce6/attachment.html>