[LLVMdev] We need better hashing

Thu Feb 9 11:23:01 PST 2012

By the way, the reason I'm bringing this up is that a number of folks are
currently working on optimizing the use of hash tables within LLVM's code
base, and unless we can come up with a common hashing facility, there will
be an increasing proliferation of cut & paste copies of hash functions. So
feedback would be nice.

On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:

> LLVM currently has a bunch of different hashing algorithms scattered
> throughout the code base.
>
> There's also a number of places in the code where a FoldingSetNodeID is
> created for the purpose of calculating a hash, and then discarded. From an
> efficiency standpoint, this isn't all that bad unless the number of
> individual items being hashed > 32, at which point the SmallVector
> overflows and memory is allocated.
>
> I personally want to see a better approach to hashing because of the
> cleanup work I've been doing - I've been replacing std::map and FoldingSet
> with DenseMap in a number of places, and plan to do more of this. The thing
> is, for complex key types, you really need to have a custom DenseMapInfo,
> and that's where having a good hash function comes in.
>
> There are a bunch of hash functions out there (FNV1, SuperFastHash, and
> many others). The best overall hash function that I am currently aware of
> is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).
>
> For LLVM's use, we want a hash function that can handle mixed data - that
> is, pointers, ints, strings, and so on. Most of the high-performance hash
> functions will work well on mixed data types, but you have to put
> everything into a flat buffer - that is, an array of machine words whose
> starting address is aligned on a machine-word boundary. The inner loops of
> the hash functions are designed to take advantage of parallelism of the
> instruction pipeline, and if you try feeding in values one at a time it's
> possible that you can lose a lot of speed. (Although I am not an expert in
> this area, so feel free to correct me on this point.) On the other hand, if
> your input values aren't already arranged into a flat buffer, the cost of
> writing them to memory has to be taken into account.
>
> Also, most of the maps in LLVM are fairly small (<1000 entries), so the
> speed of the hash function itself is probably more important than getting
> the best possible mixing of bits.
>
> It seems that for LLVM's purposes, something that has an interface similar
> to FoldingSetNodeID would make for an easy transition. One approach would
> be to start with something very much like FoldingSetNodeID, except with a
> fixed-length buffer instead of a SmallVector - the idea is that when you
> are about to overflow, instead of allocating more space, you would compute
> an intermediate hash value and then start over at the beginning of the
> buffer.
>
> Another question is whether or not you would want to replace the hash
> functions in DenseMapInfo, which are designed to be efficient for very
> small keys - most of the high-performance hash functions have a fairly
> substantial fixed overhead (usually in the form of a final mixing step) and
> thus only make sense for larger key sizes.
>
> --
> -- Talin
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120209/245b0127/attachment.html>