[cfe-dev] RFC: change string hash function in libc++

Jeffrey Yasskin jyasskin at googlers.com
Fri Dec 2 16:08:03 PST 2011


On Fri, Dec 2, 2011 at 4:00 PM, Jeffrey Yasskin <jyasskin at googlers.com> wrote:
> On Fri, Dec 2, 2011 at 3:52 PM, Craig Silverstein <csilvers at google.com> wrote:
>> } What are the numbers behind these "performs better"?
>>
>> Austin Appleby (the murmurhash author) agreed to run some numbers.
>> Here's his report.  The executive summary is that city is *much*
>> faster for strings >512 bytes, and murmur is a bit faster for strings
>> <512 bytes.
>>
>> --- Testing City128
>
> Sorry to nitpick, but http://code.google.com/p/cityhash/ says "we use
> variants of CityHash64() mainly in hash tables such as
> hash_map<string, int>. ... CityHash128() and similar return a 128-bit
> hash and are tuned for strings of at least a few hundred bytes.
> Depending on your compiler and hardware, it may be faster than
> CityHash64() on sufficiently long strings. It is known to be slower
> than necessary on shorter strings, but we expect that case to be
> relatively unimportant. Inside Google we use variants of CityHash128()
> mainly for code that wants to minimize collisions."
>
> So it's probably better to test City64 than City128 for a hash-table proposal.

Here's the cityhash performance claim from
http://code.google.com/p/cityhash/source/browse/trunk/README:

Performance on short strings
============================

For short strings, e.g., most hash table keys, CityHash64 is faster than
CityHash128, and probably faster than all the aforementioned functions,
depending on the mix of string lengths.  Here are a few results from that
same hardware, where we (unrealistically) tested a single string length over
and over again:

Hash              Results
------------------------------------------------------------------------------
CityHash64        7ns for 1 byte, or 6ns for 8 bytes, or 9ns for 64 bytes
Murmur2 (64-bit)  6ns for 1 byte, or 6ns for 8 bytes, or 15ns for 64 bytes
Murmur3F          14ns for 1 byte, or 15ns for 8 bytes, or 23ns for 64 bytes

Hash Quality
============

...

The CityHash functions have the following flaws according to SMHasher:

(1) CityHash64: some cases of poor avalanching, which we suspect don't matter in
practice, with some inputs <= 16 bytes.

(2) CityHash64WithSeed: one or two minor issues with inputs containing at most
two non-zero bytes and length <= 24 bytes.




More information about the cfe-dev mailing list