[LLVMdev] We need better hashing

Sun Feb 12 16:59:07 PST 2012

Here's my latest version of Hashing.h, which I propose to add to llvm/ADT.
Comments welcome and encouraged.

On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com> wrote:

> By the way, the reason I'm bringing this up is that a number of folks are
> currently working on optimizing the use of hash tables within LLVM's code
> base, and unless we can come up with a common hashing facility, there will
> be an increasing proliferation of cut & paste copies of hash functions. So
> feedback would be nice.
>
>
> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
>
>> LLVM currently has a bunch of different hashing algorithms scattered
>> throughout the code base.
>>
>> There's also a number of places in the code where a FoldingSetNodeID is
>> created for the purpose of calculating a hash, and then discarded. From an
>> efficiency standpoint, this isn't all that bad unless the number of
>> individual items being hashed > 32, at which point the SmallVector
>> overflows and memory is allocated.
>>
>> I personally want to see a better approach to hashing because of the
>> cleanup work I've been doing - I've been replacing std::map and FoldingSet
>> with DenseMap in a number of places, and plan to do more of this. The thing
>> is, for complex key types, you really need to have a custom DenseMapInfo,
>> and that's where having a good hash function comes in.
>>
>> There are a bunch of hash functions out there (FNV1, SuperFastHash, and
>> many others). The best overall hash function that I am currently aware of
>> is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).
>>
>> For LLVM's use, we want a hash function that can handle mixed data - that
>> is, pointers, ints, strings, and so on. Most of the high-performance hash
>> functions will work well on mixed data types, but you have to put
>> everything into a flat buffer - that is, an array of machine words whose
>> starting address is aligned on a machine-word boundary. The inner loops of
>> the hash functions are designed to take advantage of parallelism of the
>> instruction pipeline, and if you try feeding in values one at a time it's
>> possible that you can lose a lot of speed. (Although I am not an expert in
>> this area, so feel free to correct me on this point.) On the other hand, if
>> your input values aren't already arranged into a flat buffer, the cost of
>> writing them to memory has to be taken into account.
>>
>> Also, most of the maps in LLVM are fairly small (<1000 entries), so the
>> speed of the hash function itself is probably more important than getting
>> the best possible mixing of bits.
>>
>> It seems that for LLVM's purposes, something that has an interface
>> similar to FoldingSetNodeID would make for an easy transition. One approach
>> would be to start with something very much like FoldingSetNodeID, except
>> with a fixed-length buffer instead of a SmallVector - the idea is that when
>> you are about to overflow, instead of allocating more space, you would
>> compute an intermediate hash value and then start over at the beginning of
>> the buffer.
>>
>> Another question is whether or not you would want to replace the hash
>> functions in DenseMapInfo, which are designed to be efficient for very
>> small keys - most of the high-performance hash functions have a fairly
>> substantial fixed overhead (usually in the form of a final mixing step) and
>> thus only make sense for larger key sizes.
>>
>> --
>> -- Talin
>>
>
>
>
> --
> -- Talin
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120212/664503d3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Hashing.h
Type: text/x-chdr
Size: 4038 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120212/664503d3/attachment.h>