[LLVMdev] We need better hashing

Sat Feb 18 00:58:19 PST 2012

On Sat, Feb 18, 2012 at 12:00 AM, Chandler Carruth <chandlerc at google.com>wrote:

> On Fri, Feb 17, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
>
>> However, I really do need an incremental hash for the various uniquing
>> maps which I'm attempting to optimize. Take for example the case of
>> uniquing a constant array - the key consists of a type* pointer and an
>> array of constant*. Those data fields are not stored contiguously in
>> memory, so I need to hash them separately and then combine the hashes.
>> Being able to hash the data fields in place (as opposed to copying them to
>> a contiguous buffer) turns out to be a fairly significant win for the
>> uniquing maps - otherwise you end up having to do a malloc just to look up
>> a key, and that's going to be slower than any incremental hash algorithm.
>
>
> I think you have a different idea of what 'incremental hash' means from
> what I do, and that's leading to the confusion, because we're talking about
> two very different things.
>
> The term "incremental hash function" usually is a term of art referring to
> a hash function with the following property:
>
> Given a series of bytes (or other units if you like) in 's', and using a
> python-like syntax:
>
>  H(s) = H(H(s[:-1]), s[-1])
>
> This means that you can re-use a hash for a smaller chunk of data when
> computing the hash for a larger chunk. I don't think this is what you're
> looking for, and I know that most efficient, high-quality,
> non-cryptographic hash functions don't satisfy this property.
>
>
> What you're talking about is just being able to hash a non-contiguous set
> of data. That is clearly important, but there are a lot of ways to achieve
> it.
>
> Most of the functions I'm referring to are essentially block based. For
> example, CityHash is based on a 64-byte block hashing system. While this
> can necessitate copying the data, it should never require a malloc. You
> simply fill a block, and flush it out. The block can be on the stack, and
> modern hashing algorithms will find it much faster to memcpy mall
> (pointer-sized) bits of data into a single contiguous N-byte (64 in the
> case of city) block first, so I don't think this "costs" us anything, we
> actually want to copy the data in, and do the hashing algorithm once.
>
> Note that the current CityHash algorithm (especially in its current
> implementation) isn't really setup for this, but we've talked to the author
> of it as well as Austin when we were designing the standard interface, and
> they seemed very positive on being able to adapt these high-quality hashing
> algorithms to the proposed interface.
>

OK. I actually coded something like this in an earlier incarnation of the
patch, and at some point it kind of got optimized away. But that's
unimportant - what I am primarily interested in is the interface, not the
hashing function per se. (Note that the interface didn't change much when I
converted from a block-based version to the current one. Which is not
surprising, as the current interface is essentially isomorphic to the one
used by FoldingSet.)

One of the things I'm focusing on right now is taking old container classes
that were written before the advent of ArrayRef and modernizing them so
that they don't do so much allocating and copying of keys. All of this
hashing stuff is merely yak shaving as far as I am concerned - if someone
has a better idea I'm open to it, as long as they understand what my
requirements are, and the fact that my work on the containers is kind of
blocked until this gets resolved.

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120218/85bd28d0/attachment.html>