[LLVMdev] We need better hashing

Tue Feb 14 23:32:11 PST 2012

On Tue, Feb 14, 2012 at 10:47 PM, Talin <viridia at gmail.com> wrote:

> On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com> wrote:
>
>> On Feb 13, 2012, at 2:00 AM, Talin wrote:
>>
>> Just out of curiosity, why not MurmurHash3 ? This page seems to
>>> suggest that #2 has some flaw, and #3 is better all round:
>>>
>>> https://sites.google.com/site/murmurhash/
>>>
>>> The main reason is because there's no incremental version of 3.
>>
>>
>> I think that that is a great reason.
>>
>> LLVM's needs, on the other hand, are fairly modest. I'm guessing that
>> most DenseMaps contain less than a few thousand entries. Even a bad hash
>> function wouldn't hurt performance that much, and the time taken to
>> calculate the hash is probably more of a factor in overall performance than
>> getting the optimum distribution of hash values.
>>
>>
>> Indeed.  It can't be hard to be better than our existing adhoc stuff :).
>>  That said, please do not change the hash function used by StringMap
>> without do careful performance analysis of the clang preprocessor.  The
>> identifier uniquing is a very hot path in "clang -E" performance.
>>
>> I'm not planning on touching StringMap.
>
>>
>>> Would it be possible to use CityHash instead for strings?
>>>
>>> http://code.google.com/p/cityhash/
>>>
>>> OK by me. My intention is that "Hashing.h" should contain a variety of
>> different hashing algorithms for various specific needs. Right now, I am
>> mainly focusing on hashing of mixed data types - that is, you have some
>> structure which contains pointers, ints, strings, and you want to calculate
>> a hash of the entire struct. I need this because I'm working on improving
>> the uniquing of constants and similar data structures. My next target is to
>> improve the uniquing of MDNodes, but I want to get the hashing stuff
>> squared away first.
>>
>> It is also my intent that some person who is more of an expert in hashing
>> than I am can do detailed performance analysis under real-world loads (such
>> as compiling actual programs with clang) and tweak and optimize the hashing
>> algorithm, without needing to modify the API and/or all of the places that
>> call it.
>>
>>
>> I think that this is a great way to stage it.  I personally care more
>> about the interface than the implementation.  Someone can tweak and tune it
>> after enough code is using the API to get reasonable performance numbers.
>>
>>
>> #include "llvm/ADT/ArrayRef.h"
>> #include "llvm/ADT/StringRef.h"
>> #include "llvm/Support/Compiler.h"
>> #include "llvm/Support/DataTypes.h"
>> #include "llvm/Support/PointerLikeTypeTraits.h"
>> #include "llvm/Support/type_traits.h"
>>
>> Do you actually need all of these includes?  PointerLikeTypeTraits
>> doesn't seem necessary.  Is type_traits?
>>
>> Ooops, this was a cut & paste error from FoldingSet.cpp.
>
>
>>   enum {
>>     BufferSize = 32,
>>
>> BufferSize is dead.
>>
>>
>>  /// Add a pointer value
>>   template<typename T>
>>   void add(const T *PtrVal) {
>>     addImpl(
>>         reinterpret_cast<const uint32_t *>(&PtrVal),
>>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>>   }
>>
>> This violates TBAA rules and looks pretty dangerous to expose as public
>> API.  Is this really needed?  Also, addImpl is dereferencing the pointers
>> as uint32_t's, but there is nothing that guarantees that T is a multiple of
>> 4 bytes.  The ArrayRef version has the same problem.
>>
>> So as far as hashing pointer values is concerned, I was just copying the
> code from FoldingSet. Since a lot of the keys that we're going to be
> dealing with are uniqued pointers, it makes sense to be able to calculate a
> hash of the bit-value of the pointer, rather than hashing the thing pointed
> to. That being said, renaming it to "addPointer" and adding a comment might
> be in order. Similarly, I can make the ArrayRef version 'addPointers' and
> have it take an ArrayRef<T*>.
>
> Also, something I forgot to mention. Since it's possible to convert any
single T into an ArrayRef of size 1, then technically you could just have
one method that accepts an ArrayRef<T> which would work for all cases - all
of the other methods are just optimizations. In which case, my question is
do you still need a union? In other words, would the following work as
expected?

   void add(float Data) {
     addArray(ArrayRef<float>(Data));
   }

> Now, as to the 4 bytes issue, I think I can solve that with some clever
> template methods.
>
>
>> Though it is more verbose, I think it would be better to expose a
>> template specialization approach to getting the hash_value of T.
>>
>>   /// Add a float
>>   void add(float Data) {
>>     addImpl(
>>       reinterpret_cast<const uint32_t *>(&Data),
>>       reinterpret_cast<const uint32_t *>(&Data + 1));
>>   }
>>
>>   /// Add a double
>>   void add(double Data) {
>>     addImpl(
>>       reinterpret_cast<const uint32_t *>(&Data),
>>       reinterpret_cast<const uint32_t *>(&Data + 1));
>>   }
>>
>> Similarly, these need to go through a union to avoid TBAA problems.
>>
>> I'm not sure how that works. Can you give an example?
>
>>
>>  void add(StringRef StrVal) {
>>     addImpl(StrVal.data(), StrVal.size());
>>   }
>>
>> I'm contradicting my stance above about not caring about the
>> implementation :), but is MurmurHash a good hash for string data?
>>  The Bernstein hash function works really well and is much cheaper to
>> compute than Murmur.  It is used by HashString (and thus by StringMap).
>>
>> So, MurmurHash is intended for blocks of arbitrary binary data, which may
> contain character data, integers, or whatever - it's designed to do such a
> thorough job of mixing the bits that it really doesn't matter what data
> types you feed it. You are correct that for purely string data, you'd want
> to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap
> as the Bernstein hash and is AFAIK more mathematically sound.)
>
>
>>   // Add a possibly unaligned sequence of bytes.
>>   void addImpl(const char *I, size_t Length) {
>>
>> This should probably be moved out of line to avoid code bloat.
>>
>
> OK
>
>>
>> Overall, the design of the class is making sense to me!  Thanks for
>> tackling this!
>>
>> -Chris
>>
>>
>>
>
>
> --
> -- Talin
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120214/38a645ca/attachment.html>