[LLVMdev] We need better hashing

Fri Feb 17 22:58:38 PST 2012

On Fri, Feb 17, 2012 at 1:53 AM, Chandler Carruth <chandlerc at google.com>wrote:

> Jeffrey and I are working on future standard library functionality for
> hashing user defined types:
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html
>
> I would much rather have an interface that is close to or mirrors this
> one. We already have some field experience with it, and using it in LLVM
> and Clang would provide more. Also, it would be possible to essentially
> share code between such an implementation and libc++.
>
> We looked closely at 'hasher' objects and using add methods on them and
> they tended to have some serious drawbacks:
>
> 1) they require some amount of "incrementality", limiting the quality and
> performance of the hashing algorithm
> 2) they require more boiler plate
> 3) they compose recursively less cleanly
>
> Even given your interface, there is no actual requirement for an
> incremental hash. Simply store intermediate state in the object, and
> provide a 'finalize' step that produces the final hash code.
>

I'm not sure I understand what you are proposing here. I don't know what
you mean by "intermediate state". However, I really do need an incremental
hash for the various uniquing maps which I'm attempting to optimize. Take
for example the case of uniquing a constant array - the key consists of a
type* pointer and an array of constant*. Those data fields are not stored
contiguously in memory, so I need to hash them separately and then combine
the hashes. Being able to hash the data fields in place (as opposed to
copying them to a contiguous buffer) turns out to be a fairly significant
win for the uniquing maps - otherwise you end up having to do a malloc just
to look up a key, and that's going to be slower than any incremental hash
algorithm.

> On Sun, Feb 12, 2012 at 4:59 PM, Talin <viridia at gmail.com> wrote:
>
>> Here's my latest version of Hashing.h, which I propose to add to
>> llvm/ADT. Comments welcome and encouraged.
>>
>>
>> On Thu, Feb 9, 2012 at 11:23 AM, Talin <viridia at gmail.com> wrote:
>>
>>> By the way, the reason I'm bringing this up is that a number of folks
>>> are currently working on optimizing the use of hash tables within LLVM's
>>> code base, and unless we can come up with a common hashing facility, there
>>> will be an increasing proliferation of cut & paste copies of hash
>>> functions. So feedback would be nice.
>>>
>>>
>>> On Tue, Feb 7, 2012 at 10:58 PM, Talin <viridia at gmail.com> wrote:
>>>
>>>> LLVM currently has a bunch of different hashing algorithms scattered
>>>> throughout the code base.
>>>>
>>>> There's also a number of places in the code where a FoldingSetNodeID is
>>>> created for the purpose of calculating a hash, and then discarded. From an
>>>> efficiency standpoint, this isn't all that bad unless the number of
>>>> individual items being hashed > 32, at which point the SmallVector
>>>> overflows and memory is allocated.
>>>>
>>>> I personally want to see a better approach to hashing because of the
>>>> cleanup work I've been doing - I've been replacing std::map and FoldingSet
>>>> with DenseMap in a number of places, and plan to do more of this. The thing
>>>> is, for complex key types, you really need to have a custom DenseMapInfo,
>>>> and that's where having a good hash function comes in.
>>>>
>>>> There are a bunch of hash functions out there (FNV1, SuperFastHash, and
>>>> many others). The best overall hash function that I am currently aware of
>>>> is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).
>>>>
>>>> For LLVM's use, we want a hash function that can handle mixed data -
>>>> that is, pointers, ints, strings, and so on. Most of the high-performance
>>>> hash functions will work well on mixed data types, but you have to put
>>>> everything into a flat buffer - that is, an array of machine words whose
>>>> starting address is aligned on a machine-word boundary. The inner loops of
>>>> the hash functions are designed to take advantage of parallelism of the
>>>> instruction pipeline, and if you try feeding in values one at a time it's
>>>> possible that you can lose a lot of speed. (Although I am not an expert in
>>>> this area, so feel free to correct me on this point.) On the other hand, if
>>>> your input values aren't already arranged into a flat buffer, the cost of
>>>> writing them to memory has to be taken into account.
>>>>
>>>> Also, most of the maps in LLVM are fairly small (<1000 entries), so the
>>>> speed of the hash function itself is probably more important than getting
>>>> the best possible mixing of bits.
>>>>
>>>> It seems that for LLVM's purposes, something that has an interface
>>>> similar to FoldingSetNodeID would make for an easy transition. One approach
>>>> would be to start with something very much like FoldingSetNodeID, except
>>>> with a fixed-length buffer instead of a SmallVector - the idea is that when
>>>> you are about to overflow, instead of allocating more space, you would
>>>> compute an intermediate hash value and then start over at the beginning of
>>>> the buffer.
>>>>
>>>> Another question is whether or not you would want to replace the hash
>>>> functions in DenseMapInfo, which are designed to be efficient for very
>>>> small keys - most of the high-performance hash functions have a fairly
>>>> substantial fixed overhead (usually in the form of a final mixing step) and
>>>> thus only make sense for larger key sizes.
>>>>
>>>> --
>>>> -- Talin
>>>>
>>>
>>>
>>>
>>> --
>>> -- Talin
>>>
>>
>>
>>
>> --
>> -- Talin
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>

-- 
-- Talin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120217/5e4332d5/attachment.html>