[cfe-dev] RFC: change string hash function in libc++
Howard Hinnant
hhinnant at apple.com
Sun Dec 4 09:50:55 PST 2011
On Dec 4, 2011, at 6:11 AM, Daniel James wrote:
> On 4 December 2011 01:17, Howard Hinnant <hhinnant at apple.com> wrote:
>> I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map. I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.
>
> And of course benchmarks should also be made in the context of the
> hash container.
>
> IIUC these hash functions are designed to be used with containers
> whose number of buckets is a power of 2 which can be faster since they
> use bit operations to select the bucket. So it might be worth testing
> for that case as well as the current implementation. Doing that in an
> STL container would require some template thing to determine which
> strategy to use. I don't know if you would consider that acceptable.
The libc++ std::hash are being designed/tested only with the libc++ unordered containers. The libc++ containers restrict the number of buckets to a prime (which the client can select). Perhaps I'm misunderstanding your point?
>>
>> I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has. A score of 0.f indicates that all values are hashed into the same bucket. A score of 100.f indicates that there are no collisions.
>>
>> // A quantitative grade [0, 100] of how many collisions there
>> // are in an unordred_set/map.
>> // 0 == all elements in the same bucket
>> // 100 == no collisions
>> // A lower max_load_factor pushes this score towards 100
>> // A higher max_load_factor pushes this score towards 0
>> template <class C>
>> float
>> grade(const C& c)
>> {
>> using namespace std;
>> if (c.size() <= 1)
>> return 100;
>> float score = 0;
>> size_t bc = c.bucket_count();
>> for (size_t i = 0; i != bc; ++i)
>> {
>> size_t bs = c.bucket_size(i);
>> if (bs > 1)
>> score += bs - 1;
>> }
>> return 100*(1 - score / (c.size() - 1));
>> }
>>
>> I'm not at this time proposing putting this code into libc++ nor even in the test suite. I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state. I've been working with it for only a day now, but I'm happy with it. But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.
>
> This metric underestimates the cost of multiple collisions in a single
> bucket. For example, it will give the same result for the case when
> you have 98 elements in bucket 1, 2 in bucket 2 as it would for 50
> elements in bucket 1, 50 in bucket 2. A better metric might be the
> expected number of element lookups to find an element in the table -
> it's equal to the sum of (bucket_size(i) * (bucket_size(i) + 1) / 2)
> for all buckets divided by the number of elements.
I like this suggestion a lot! I've switched to it.
> Another metric would be to place a set of elements in the container,
> and then measure the number of lookups required for a separate set of
> elements. Using real data is also a good idea; randomly generated data
> tends to hash to random buckets.
So far I've just been testing hash<arithmetic types>, though by the end of the day I hope to have moved on to hash<string>. I'm open to concrete suggestions on data sets I should be testing. For floating point types I've been using:
mt19937_64 eng;
uniform_real_distribution<T> dist(numeric_limits<T>::min(),
numeric_limits<T>::max());
And for integral types:
mt19937_64 eng;
uniform_int_distribution<T> dist(numeric_limits<T>::min(),
numeric_limits<T>::max());
For example here is one test I'm running:
#include <unordered_set>
#include <random>
#include <iostream>
// Computes the average number of comparisions per lookup.
// A perfect hash will return 1.
template <class C>
float
grade(const C& c)
{
using namespace std;
if (c.size() <= 1)
return 100;
float score = 0;
size_t bc = c.bucket_count();
for (size_t i = 0; i != bc; ++i)
{
size_t bs = c.bucket_size(i);
score += bs * (bs+1) / 2;
}
return score / c.size();
}
int main()
{
typedef long long T;
using namespace std;
mt19937_64 eng;
// uniform_real_distribution<T> dist(numeric_limits<T>::min(),
// numeric_limits<T>::max());
uniform_int_distribution<T> dist(numeric_limits<T>::min(),
numeric_limits<T>::max());
unordered_set<T> table;
table.max_load_factor(1);
for (int i = 0; i < 100; ++i)
{
for (int j = 0; j < 10000; ++j)
table.insert(dist(eng));
cout << "score = " << grade(table) << '\n';
}
}
One of the decisions I'm making based on this test is what to do with two (or four) size_t's when I need to combine them into one. What's checked in is:
return __u.__a ^ __u.__b;
But I was wondering if it would be better to:
return __murmur2<size_t>()(&__u, sizeof(__u));
My results so far are indicating that for arithmetic tests the use of murmur isn't significantly better than a simple xor. I'm also not seeing any motivation at this time to run integral types with sizeof <= sizeof(size_t) through something line murmur, instead of just:
return static_cast<size_t>(__v);
I'm looking at this because of http://llvm.org/bugs/show_bug.cgi?id=10971 .
Howard
More information about the cfe-dev
mailing list