[LLVMdev] We need better hashing

Sat Feb 18 00:00:12 PST 2012

On Fri, Feb 17, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:

> However, I really do need an incremental hash for the various uniquing
> maps which I'm attempting to optimize. Take for example the case of
> uniquing a constant array - the key consists of a type* pointer and an
> array of constant*. Those data fields are not stored contiguously in
> memory, so I need to hash them separately and then combine the hashes.
> Being able to hash the data fields in place (as opposed to copying them to
> a contiguous buffer) turns out to be a fairly significant win for the
> uniquing maps - otherwise you end up having to do a malloc just to look up
> a key, and that's going to be slower than any incremental hash algorithm.

I think you have a different idea of what 'incremental hash' means from
what I do, and that's leading to the confusion, because we're talking about
two very different things.

The term "incremental hash function" usually is a term of art referring to
a hash function with the following property:

Given a series of bytes (or other units if you like) in 's', and using a
python-like syntax:

 H(s) = H(H(s[:-1]), s[-1])

This means that you can re-use a hash for a smaller chunk of data when
computing the hash for a larger chunk. I don't think this is what you're
looking for, and I know that most efficient, high-quality,
non-cryptographic hash functions don't satisfy this property.

What you're talking about is just being able to hash a non-contiguous set
of data. That is clearly important, but there are a lot of ways to achieve
it.

Most of the functions I'm referring to are essentially block based. For
example, CityHash is based on a 64-byte block hashing system. While this
can necessitate copying the data, it should never require a malloc. You
simply fill a block, and flush it out. The block can be on the stack, and
modern hashing algorithms will find it much faster to memcpy mall
(pointer-sized) bits of data into a single contiguous N-byte (64 in the
case of city) block first, so I don't think this "costs" us anything, we
actually want to copy the data in, and do the hashing algorithm once.

Note that the current CityHash algorithm (especially in its current
implementation) isn't really setup for this, but we've talked to the author
of it as well as Austin when we were designing the standard interface, and
they seemed very positive on being able to adapt these high-quality hashing
algorithms to the proposed interface.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120218/72da270c/attachment.html>