[cfe-dev] RFC: change string hash function in libc++
Jeffrey Yasskin
jyasskin at googlers.com
Fri Dec 2 16:08:03 PST 2011
On Fri, Dec 2, 2011 at 4:00 PM, Jeffrey Yasskin <jyasskin at googlers.com> wrote:
> On Fri, Dec 2, 2011 at 3:52 PM, Craig Silverstein <csilvers at google.com> wrote:
>> } What are the numbers behind these "performs better"?
>>
>> Austin Appleby (the murmurhash author) agreed to run some numbers.
>> Here's his report. The executive summary is that city is *much*
>> faster for strings >512 bytes, and murmur is a bit faster for strings
>> <512 bytes.
>>
>> --- Testing City128
>
> Sorry to nitpick, but http://code.google.com/p/cityhash/ says "we use
> variants of CityHash64() mainly in hash tables such as
> hash_map<string, int>. ... CityHash128() and similar return a 128-bit
> hash and are tuned for strings of at least a few hundred bytes.
> Depending on your compiler and hardware, it may be faster than
> CityHash64() on sufficiently long strings. It is known to be slower
> than necessary on shorter strings, but we expect that case to be
> relatively unimportant. Inside Google we use variants of CityHash128()
> mainly for code that wants to minimize collisions."
>
> So it's probably better to test City64 than City128 for a hash-table proposal.
Here's the cityhash performance claim from
http://code.google.com/p/cityhash/source/browse/trunk/README:
Performance on short strings
============================
For short strings, e.g., most hash table keys, CityHash64 is faster than
CityHash128, and probably faster than all the aforementioned functions,
depending on the mix of string lengths. Here are a few results from that
same hardware, where we (unrealistically) tested a single string length over
and over again:
Hash Results
------------------------------------------------------------------------------
CityHash64 7ns for 1 byte, or 6ns for 8 bytes, or 9ns for 64 bytes
Murmur2 (64-bit) 6ns for 1 byte, or 6ns for 8 bytes, or 15ns for 64 bytes
Murmur3F 14ns for 1 byte, or 15ns for 8 bytes, or 23ns for 64 bytes
Hash Quality
============
...
The CityHash functions have the following flaws according to SMHasher:
(1) CityHash64: some cases of poor avalanching, which we suspect don't matter in
practice, with some inputs <= 16 bytes.
(2) CityHash64WithSeed: one or two minor issues with inputs containing at most
two non-zero bytes and length <= 24 bytes.
More information about the cfe-dev
mailing list