[LLVMdev] We need better hashing

Tue Feb 7 22:58:20 PST 2012

LLVM currently has a bunch of different hashing algorithms scattered
throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is
created for the purpose of calculating a hash, and then discarded. From an
efficiency standpoint, this isn't all that bad unless the number of
individual items being hashed > 32, at which point the SmallVector
overflows and memory is allocated.

I personally want to see a better approach to hashing because of the
cleanup work I've been doing - I've been replacing std::map and FoldingSet
with DenseMap in a number of places, and plan to do more of this. The thing
is, for complex key types, you really need to have a custom DenseMapInfo,
and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and
many others). The best overall hash function that I am currently aware of
is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that
is, pointers, ints, strings, and so on. Most of the high-performance hash
functions will work well on mixed data types, but you have to put
everything into a flat buffer - that is, an array of machine words whose
starting address is aligned on a machine-word boundary. The inner loops of
the hash functions are designed to take advantage of parallelism of the
instruction pipeline, and if you try feeding in values one at a time it's
possible that you can lose a lot of speed. (Although I am not an expert in
this area, so feel free to correct me on this point.) On the other hand, if
your input values aren't already arranged into a flat buffer, the cost of
writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the
speed of the hash function itself is probably more important than getting
the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar
to FoldingSetNodeID would make for an easy transition. One approach would
be to start with something very much like FoldingSetNodeID, except with a
fixed-length buffer instead of a SmallVector - the idea is that when you
are about to overflow, instead of allocating more space, you would compute
an intermediate hash value and then start over at the beginning of the
buffer.

Another question is whether or not you would want to replace the hash
functions in DenseMapInfo, which are designed to be efficient for very
small keys - most of the high-performance hash functions have a fairly
substantial fixed overhead (usually in the form of a final mixing step) and
thus only make sense for larger key sizes.

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120207/4dc9995f/attachment.html>