[cfe-dev] RFC: change string hash function in libc++

Sun Dec 4 09:50:55 PST 2011

On Dec 4, 2011, at 6:11 AM, Daniel James wrote:

> On 4 December 2011 01:17, Howard Hinnant <hhinnant at apple.com> wrote:
>> I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map.  I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.
> 
> And of course benchmarks should also be made in the context of the
> hash container.
> 
> IIUC these hash functions are designed to be used with containers
> whose number of buckets is a power of 2 which can be faster since they
> use bit operations to select the bucket. So it might be worth testing
> for that case as well as the current implementation. Doing that in an
> STL container would require some template thing to determine which
> strategy to use. I don't know if you would consider that acceptable.

The libc++ std::hash are being designed/tested only with the libc++ unordered containers.  The libc++ containers restrict the number of buckets to a prime (which the client can select).  Perhaps I'm misunderstanding your point?

>> 
>> I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has.  A score of 0.f indicates that all values are hashed into the same bucket.  A score of 100.f indicates that there are no collisions.
>> 
>> //  A quantitative grade [0, 100] of how many collisions there
>> //    are in an unordred_set/map.
>> //    0 == all elements in the same bucket
>> //  100 == no collisions
>> //  A lower max_load_factor pushes this score towards 100
>> //  A higher max_load_factor pushes this score towards 0
>> template <class C>
>> float
>> grade(const C& c)
>> {
>>    using namespace std;
>>    if (c.size() <= 1)
>>        return 100;
>>    float score = 0;
>>    size_t bc = c.bucket_count();
>>    for (size_t i = 0; i != bc; ++i)
>>    {
>>        size_t bs = c.bucket_size(i);
>>        if (bs > 1)
>>            score += bs - 1;
>>    }
>>    return 100*(1 - score / (c.size() - 1));
>> }
>> 
>> I'm not at this time proposing putting this code into libc++ nor even in the test suite.  I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state.  I've been working with it for only a day now, but I'm happy with it.  But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.
> 
> This metric underestimates the cost of multiple collisions in a single
> bucket. For example, it will give the same result for the case when
> you have 98 elements in bucket 1, 2 in bucket 2 as it would for 50
> elements in bucket 1, 50 in bucket 2. A better metric might be the
> expected number of element lookups to find an element in the table -
> it's equal to the sum of (bucket_size(i) * (bucket_size(i) + 1) / 2)
> for all buckets divided by the number of elements.

I like this suggestion a lot!  I've switched to it.

> Another metric would be to place a set of elements in the container,
> and then measure the number of lookups required for a separate set of
> elements. Using real data is also a good idea; randomly generated data
> tends to hash to random buckets.

So far I've just been testing hash<arithmetic types>, though by the end of the day I hope to have moved on to hash<string>.  I'm open to concrete suggestions on data sets I should be testing.  For floating point types I've been using:

    mt19937_64 eng;
    uniform_real_distribution<T> dist(numeric_limits<T>::min(), 
                                      numeric_limits<T>::max());

And for integral types:

    mt19937_64 eng;
    uniform_int_distribution<T> dist(numeric_limits<T>::min(), 
                                     numeric_limits<T>::max());

For example here is one test I'm running:

#include <unordered_set>
#include <random>
#include <iostream>

//  Computes the average number of comparisions per lookup.
//  A perfect hash will return 1.
template <class C>
float
grade(const C& c)
{
    using namespace std;
    if (c.size() <= 1)
        return 100;
    float score = 0;
    size_t bc = c.bucket_count();
    for (size_t i = 0; i != bc; ++i)
    {
        size_t bs = c.bucket_size(i);
        score += bs * (bs+1) / 2;
    }
    return score / c.size();
}

int main()
{
    typedef long long T;
    using namespace std;
    mt19937_64 eng;
//     uniform_real_distribution<T> dist(numeric_limits<T>::min(), 
//                                       numeric_limits<T>::max());
    uniform_int_distribution<T> dist(numeric_limits<T>::min(), 
                                     numeric_limits<T>::max());
    unordered_set<T> table;
    table.max_load_factor(1);
    for (int i = 0; i < 100; ++i)
    {
        for (int j = 0; j < 10000; ++j)
            table.insert(dist(eng));
        cout << "score = " << grade(table) << '\n';
    }
}

One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one.  What's checked in is:

        return __u.__a ^ __u.__b;

But I was wondering if it would be better to:

        return __murmur2<size_t>()(&__u, sizeof(__u));

My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor.  I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:

    return static_cast<size_t>(__v);

I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .

Howard