[LLVMdev] We need better hashing

Mon Feb 13 02:00:22 PST 2012

On Mon, Feb 13, 2012 at 1:22 AM, Jay Foad <jay.foad at gmail.com> wrote:

> On 13 February 2012 00:59, Talin <viridia at gmail.com> wrote:
> > Here's my latest version of Hashing.h, which I propose to add to
> llvm/ADT.
> > Comments welcome and encouraged.
>
> > /// Adapted from MurmurHash2 by Austin Appleby
>
> Just out of curiosity, why not MurmurHash3 ? This page seems to
> suggest that #2 has some flaw, and #3 is better all round:
>
> https://sites.google.com/site/murmurhash/
>
> The main reason is because there's no incremental version of 3. If you
look at the source, you'll notice that the very first thing that 3 does is
Hash ^= Length, but for the incremental case we don't know the length until
we're done. Also, 2 has fewer instructions per block hashed than 3; 3 also
requires bit rotations, whereas 2 only uses bit shifts.

Bear in mind that the "flaw" in MurmurHash2 is a fairly esoteric case which
(I suspect) LLVM is unlikely to ever encounter in practice. Austin is
trying to develop the best possible hash for a wide range of key
probability distributions, and his testing methodologies are quite strict.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that most
DenseMaps contain less than a few thousand entries. Even a bad hash
function wouldn't hurt performance that much, and the time taken to
calculate the hash is probably more of a factor in overall performance than
getting the optimum distribution of hash values.

Would it be possible to use CityHash instead for strings?
>
> http://code.google.com/p/cityhash/
>
> OK by me. My intention is that "Hashing.h" should contain a variety of
different hashing algorithms for various specific needs. Right now, I am
mainly focusing on hashing of mixed data types - that is, you have some
structure which contains pointers, ints, strings, and you want to calculate
a hash of the entire struct. I need this because I'm working on improving
the uniquing of constants and similar data structures. My next target is to
improve the uniquing of MDNodes, but I want to get the hashing stuff
squared away first.

It is also my intent that some person who is more of an expert in hashing
than I am can do detailed performance analysis under real-world loads (such
as compiling actual programs with clang) and tweak and optimize the hashing
algorithm, without needing to modify the API and/or all of the places that
call it.

> Thanks,
> Jay.
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120213/5531b894/attachment.html>