[cfe-dev] RFC: change string hash function in libc++

Sun Dec 4 12:07:19 PST 2011

On 4 December 2011 17:50, Howard Hinnant <hhinnant at apple.com> wrote:
> On Dec 4, 2011, at 6:11 AM, Daniel James wrote:
>
>> IIUC these hash functions are designed to be used with containers
>> whose number of buckets is a power of 2 which can be faster since they
>> use bit operations to select the bucket. So it might be worth testing
>> for that case as well as the current implementation. Doing that in an
>> STL container would require some template thing to determine which
>> strategy to use. I don't know if you would consider that acceptable.
>
> The libc++ std::hash are being designed/tested only with the libc++ unordered containers.  The libc++ containers restrict the number of buckets to a prime (which the client can select).  Perhaps I'm misunderstanding your point?

Probably not, I was mostly blathering on. My main point was that these
hash functions seem to be optimised for a different use case. That
doesn't mean they aren't good for this use case, but there are fewer
benefits.

>> Another metric would be to place a set of elements in the container,
>> and then measure the number of lookups required for a separate set of
>> elements. Using real data is also a good idea; randomly generated data
>> tends to hash to random buckets.
>
> So far I've just been testing hash<arithmetic types>, though by the end of the day I hope to have moved on to hash<string>.  I'm open to concrete suggestions on data sets I should be testing.

For one example, I once read a paper (that I can't find) which used IP
addresses from server logs. I think it was for strings but converting
ip addresses to integer values would be an interesting. IP addresses
tend to have patterns that you don't see in random data because of
they way they're allocated.

> One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one.  What's checked in is:
>
>        return __u.__a ^ __u.__b;

That does seem pretty bad. Using a random distribution won't catch the
potential problems with something like that. For integers it's quite
easy to come up with a realistic example of where that could go wrong.
Say you're using it to hash 64-bit integers to 32-bit values. Someone
comes along and stores 32-bit co-ordinate pairs in 64-bit values (i.e.
x * 2^32 + y). Now (x,y) will always hash to the same value as (y,x),
When x=y, (x,y) will always hash to 0.

> But I was wondering if it would be better to:
>
>        return __murmur2<size_t>()(&__u, sizeof(__u));

That's actually a bit buggy for floats. On some platforms sizeof is
larger than the actual size of the floating point type and includes
some padding for alignment which can contain junk values (very common
for long double). It isn't always guaranteed that equal floating point
values have different representations (although IIRC it is for the
IEEE apart from weird edge cases like 0, -0 and maybe the infinity
values). Sometimes you get really weird representations, such as using
the sum of two doubles for a long double. I think Apple used to do
that for PowerPC macs.

You can see my attempt at a portable hash function for floats in boost
(boost/functional/hash/detail/hash_float_*.hpp), but it isn't great.
I'm sure an expert could do better. I only used a binary hash for one
particular setup where it was needed. I considered doing it elsewhere,
but it would have taken too much maintenance. It's such an odd use
case I'm not sure if anyone has ever used it.

> My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor.  I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:
>
>    return static_cast<size_t>(__v);

This is worth a read:

http://www.concentric.net/~ttwang/tech/inthash.htm

> I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .

His test seems very contrived. All hash functions are going to have
worst cases, the question is how likely it is that they'll turn up.