[cfe-dev] RFC: change string hash function in libc++

Howard Hinnant hhinnant at apple.com
Sun Dec 4 16:12:07 PST 2011


On Dec 4, 2011, at 3:07 PM, Daniel James wrote:

> On 4 December 2011 17:50, Howard Hinnant <hhinnant at apple.com> wrote:
>> On Dec 4, 2011, at 6:11 AM, Daniel James wrote:
>> 
>>> IIUC these hash functions are designed to be used with containers
>>> whose number of buckets is a power of 2 which can be faster since they
>>> use bit operations to select the bucket. So it might be worth testing
>>> for that case as well as the current implementation. Doing that in an
>>> STL container would require some template thing to determine which
>>> strategy to use. I don't know if you would consider that acceptable.
>> 
>> The libc++ std::hash are being designed/tested only with the libc++ unordered containers.  The libc++ containers restrict the number of buckets to a prime (which the client can select).  Perhaps I'm misunderstanding your point?
> 
> Probably not, I was mostly blathering on. My main point was that these
> hash functions seem to be optimised for a different use case. That
> doesn't mean they aren't good for this use case, but there are fewer
> benefits.

Now that I've studied the problem more, I think I understand you better here.  An identity hash function for a small integral type is a lot less appealing when your hash container has a power of 2 number of buckets.  The identity hash function collides "more often" in that environment.

>> One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one.  What's checked in is:
>> 
>>        return __u.__a ^ __u.__b;
> 
> That does seem pretty bad. Using a random distribution won't catch the
> potential problems with something like that. For integers it's quite
> easy to come up with a realistic example of where that could go wrong.
> Say you're using it to hash 64-bit integers to 32-bit values. Someone
> comes along and stores 32-bit co-ordinate pairs in 64-bit values (i.e.
> x * 2^32 + y). Now (x,y) will always hash to the same value as (y,x),
> When x=y, (x,y) will always hash to 0.

I've been convinced to use murmur2 when I need to combine 2, 3, or 4 size_t's into a single one, except for the case of long double on x86.  My rationale is that long double patterns are not likely to accidentally occur often in a long double.  So I'm sticking with xor for long double for now.  If I see cases that could easily occur accidentally, this can always be changed in the future.

I settled on murmur2 instead of murmur3 because of the easy availability of the 32 bit and 64 bit versions.  I did not see a 64 bit murmur3, and figured murmur2 is good enough.  murmur2 won't actually ever be used when size_t is 64 bits.  It will only be activated for (unsigned) long long and double on 32 bit platforms.

Oh, except that I activated murmur2 for basic_string as well.  The tests I ran weren't that convincing.  However I also admit that there is a lot I don't know here, and people are asking for murmur.  And what I had was operating a character at a time and murmur operates a word at a time.   So done.

>> But I was wondering if it would be better to:
>> 
>>        return __murmur2<size_t>()(&__u, sizeof(__u));
> 
> That's actually a bit buggy for floats. On some platforms sizeof is
> larger than the actual size of the floating point type and includes
> some padding for alignment which can contain junk values (very common
> for long double). It isn't always guaranteed that equal floating point
> values have different representations (although IIRC it is for the
> IEEE apart from weird edge cases like 0, -0 and maybe the infinity
> values). Sometimes you get really weird representations, such as using
> the sum of two doubles for a long double. I think Apple used to do
> that for PowerPC macs.

I hit this bug twice in the past few days (once for float and once for long double).  I think it is solved now by initializing to 0 in the right places.

> 
> You can see my attempt at a portable hash function for floats in boost
> (boost/functional/hash/detail/hash_float_*.hpp), but it isn't great.
> I'm sure an expert could do better. I only used a binary hash for one
> particular setup where it was needed. I considered doing it elsewhere,
> but it would have taken too much maintenance. It's such an odd use
> case I'm not sure if anyone has ever used it.

Looked at your hash_combine.  Tested it against xor and murmur.  murmur impressed me the most with 1 bit being able to influence all other bits in the size_t, for both 32 and 64 bit platforms.  Your hash_float when sizeof(float) <= sizeof(size_t) is I believe identical to what is in there now.  Except you're using a pointer cast and I'm using a union (I was using a pointer cast too until Dave Zarzycki pointed out a nasty bug just a few days ago -- I don't see the same bug in your implementation).

>> My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor.  I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:
>> 
>>    return static_cast<size_t>(__v);
> 
> This is worth a read:
> 
> http://www.concentric.net/~ttwang/tech/inthash.htm

Thanks, that was a helpful read.

>> I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .
> 
> His test seems very contrived. All hash functions are going to have
> worst cases, the question is how likely it is that they'll turn up.

After much thought and testing, I've commented and closed this bug as "behaves correctly".

Howard




More information about the cfe-dev mailing list