[cfe-commits] [patch] change libc++'s string hash function to cityhash64

Craig Silverstein csilvers at google.com
Thu Dec 8 17:16:03 PST 2011


Below is a patch that changes libc++'s hash<string> to use cityhash64
(http://code.google.com/p/cityhash/) for machines where size_t is 64
bits.  I did not change the code where size_t is 32 bits; it will
still use murmur2 for that.  This is because cityhash64 needs a fast
64bit x 64bit multiply, which 32-bit systems are unlikely to have.

I wrote code to test the change, based on Howard's code posted
earlier.  It's below for reference.  Here is the speed improvement for
using cityhash, on my linux desktop machine:
   model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
All statistics below are averaged over 3 runs.

RUN TIME (compiled with -O2):
   old (murmur2):    0:09.20
   new (cityhash64): 0:08.60   // 6.5% faster

This is corroborated by lower numbers for the collision rate, as
emitted by the test program (smaller numbers are better):

murmur2:
score = 1.3934
score = 1.42452
score = 1.35706
score = 1.47355
score = 1.59226
score = 1.4267
score = 1.494
score = 1.56537
score = 1.63725
score = 1.71011

cityhash64:
score = 1.38081
score = 1.3898
score = 1.2847
score = 1.38114
score = 1.47815
score = 1.28579
score = 1.33372
score = 1.38128
score = 1.42893
score = 1.47695

The downside: cityhash64 is a non-trivial amount of code (it has
special-case code for different input sizes, for speed purposes).
Here is the comparison of compile time and resulting binary size.  The
compile command I used was:
   clang++ -std=c++0x -stdlib=libc++ -isystem`pwd`/../include -L`pwd`/../lib -Wl,-rpath `pwd`/../lib /var/tmp/libcxx_test.cc -o /var/tmp/libcxx_test

COMPILE TIME (no flags):
old (murmur2): 0:08.82
new (cityhash64): 0:08.95   // 1.5% slower

SIZE (no flags):
murmur2: 78121
cityhash64: 83082   // 6.4% bigger

SIZE (-g):
murmur2: 377745
cityhash64: 387554  // 2.6% bigger

SIZE (-g -O2):
murmur2: 326673
cityhash64: 342554   // 4.9% bigger

SIZE (-O2):
murmur2: 31841
cityhash64: 36082    // 13.3% bigger

My viewpoint (for what it's worth), is that the speed increase is
worth these costs.  If the maintainers disagree, there are other
points along the speed/size tradeoff, however, including removing the
special-case code for small strings, or moving from cityhash64 to
murmur64.

Here is the test code:
---
#include <unordered_set>
#include <iostream>
#include <fstream>
#include <string>
#include <vector>

//  Computes the average number of comparisions per lookup.
//  A perfect hash will return 1.
template <class C>
float
grade(const C& c)
{
    using namespace std;
    if (c.size() <= 1)
        return 100;
    float score = 0;
    size_t bc = c.bucket_count();
    for (size_t i = 0; i != bc; ++i)
    {
        size_t bs = c.bucket_size(i);
        score += bs * (bs+1) / 2;
    }
    return score / c.size();
}

int main()
{
    using namespace std;
    typedef string T;
    vector<T> words;
    filebuf fb;
    fb.open("/usr/share/dict/words",ios::in);
    for (istream wordstream(&fb); wordstream; )
    {
        string word;
        wordstream >> word;
        words.push_back(word);
    }
    for (int repeat = 0; repeat < 200; ++repeat)
    {
        unordered_set<T> table;
        table.max_load_factor(1);
        // /usr/dict/words has 98569 words in it
        for (int i = 0; i < 10; ++i)
        {
            for (int j = 0; j < 9850; ++j)
            {
                table.insert(words[j*10+i]);
            }
            if (repeat == 0) cout << "score = " << grade(table) << '\n';
        }
    }
}
---

craig
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cityhash.patch
Type: text/text
Size: 9971 bytes
Desc: /var/tmp/cityhash.patch
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20111208/b8d6da05/attachment.bin>


More information about the cfe-commits mailing list