[cfe-commits] [patch] change libc++'s string hash function to cityhash64
Craig Silverstein
csilvers at google.com
Thu Dec 8 17:16:03 PST 2011
Below is a patch that changes libc++'s hash<string> to use cityhash64
(http://code.google.com/p/cityhash/) for machines where size_t is 64
bits. I did not change the code where size_t is 32 bits; it will
still use murmur2 for that. This is because cityhash64 needs a fast
64bit x 64bit multiply, which 32-bit systems are unlikely to have.
I wrote code to test the change, based on Howard's code posted
earlier. It's below for reference. Here is the speed improvement for
using cityhash, on my linux desktop machine:
model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
All statistics below are averaged over 3 runs.
RUN TIME (compiled with -O2):
old (murmur2): 0:09.20
new (cityhash64): 0:08.60 // 6.5% faster
This is corroborated by lower numbers for the collision rate, as
emitted by the test program (smaller numbers are better):
murmur2:
score = 1.3934
score = 1.42452
score = 1.35706
score = 1.47355
score = 1.59226
score = 1.4267
score = 1.494
score = 1.56537
score = 1.63725
score = 1.71011
cityhash64:
score = 1.38081
score = 1.3898
score = 1.2847
score = 1.38114
score = 1.47815
score = 1.28579
score = 1.33372
score = 1.38128
score = 1.42893
score = 1.47695
The downside: cityhash64 is a non-trivial amount of code (it has
special-case code for different input sizes, for speed purposes).
Here is the comparison of compile time and resulting binary size. The
compile command I used was:
clang++ -std=c++0x -stdlib=libc++ -isystem`pwd`/../include -L`pwd`/../lib -Wl,-rpath `pwd`/../lib /var/tmp/libcxx_test.cc -o /var/tmp/libcxx_test
COMPILE TIME (no flags):
old (murmur2): 0:08.82
new (cityhash64): 0:08.95 // 1.5% slower
SIZE (no flags):
murmur2: 78121
cityhash64: 83082 // 6.4% bigger
SIZE (-g):
murmur2: 377745
cityhash64: 387554 // 2.6% bigger
SIZE (-g -O2):
murmur2: 326673
cityhash64: 342554 // 4.9% bigger
SIZE (-O2):
murmur2: 31841
cityhash64: 36082 // 13.3% bigger
My viewpoint (for what it's worth), is that the speed increase is
worth these costs. If the maintainers disagree, there are other
points along the speed/size tradeoff, however, including removing the
special-case code for small strings, or moving from cityhash64 to
murmur64.
Here is the test code:
---
#include <unordered_set>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
// Computes the average number of comparisions per lookup.
// A perfect hash will return 1.
template <class C>
float
grade(const C& c)
{
using namespace std;
if (c.size() <= 1)
return 100;
float score = 0;
size_t bc = c.bucket_count();
for (size_t i = 0; i != bc; ++i)
{
size_t bs = c.bucket_size(i);
score += bs * (bs+1) / 2;
}
return score / c.size();
}
int main()
{
using namespace std;
typedef string T;
vector<T> words;
filebuf fb;
fb.open("/usr/share/dict/words",ios::in);
for (istream wordstream(&fb); wordstream; )
{
string word;
wordstream >> word;
words.push_back(word);
}
for (int repeat = 0; repeat < 200; ++repeat)
{
unordered_set<T> table;
table.max_load_factor(1);
// /usr/dict/words has 98569 words in it
for (int i = 0; i < 10; ++i)
{
for (int j = 0; j < 9850; ++j)
{
table.insert(words[j*10+i]);
}
if (repeat == 0) cout << "score = " << grade(table) << '\n';
}
}
}
---
craig
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cityhash.patch
Type: text/text
Size: 9971 bytes
Desc: /var/tmp/cityhash.patch
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20111208/b8d6da05/attachment.bin>
More information about the cfe-commits
mailing list