[cfe-dev] RFC: change string hash function in libc++

Sun Dec 4 03:11:53 PST 2011

On 4 December 2011 01:17, Howard Hinnant <hhinnant at apple.com> wrote:
> I'm attempting to quantify the goodness of a hash function installed into a std::unordered_set/map.  I.e. ultimately the number of collisions is determined not only by the hash function, but by the implementation of the unordered container as well.

And of course benchmarks should also be made in the context of the
hash container.

IIUC these hash functions are designed to be used with containers
whose number of buckets is a power of 2 which can be faster since they
use bit operations to select the bucket. So it might be worth testing
for that case as well as the current implementation. Doing that in an
STL container would require some template thing to determine which
strategy to use. I don't know if you would consider that acceptable.

> I've put together the following test function which scores a container at any given time during its lifetime based on the number of collisions it has.  A score of 0.f indicates that all values are hashed into the same bucket.  A score of 100.f indicates that there are no collisions.
>
> //  A quantitative grade [0, 100] of how many collisions there
> //    are in an unordred_set/map.
> //    0 == all elements in the same bucket
> //  100 == no collisions
> //  A lower max_load_factor pushes this score towards 100
> //  A higher max_load_factor pushes this score towards 0
> template <class C>
> float
> grade(const C& c)
> {
>    using namespace std;
>    if (c.size() <= 1)
>        return 100;
>    float score = 0;
>    size_t bc = c.bucket_count();
>    for (size_t i = 0; i != bc; ++i)
>    {
>        size_t bs = c.bucket_size(i);
>        if (bs > 1)
>            score += bs - 1;
>    }
>    return 100*(1 - score / (c.size() - 1));
> }
>
> I'm not at this time proposing putting this code into libc++ nor even in the test suite.  I'm simply requesting comments on if this is a reasonable tool for quantifying the goodness of an unordered container's state.  I've been working with it for only a day now, but I'm happy with it.  But I'm not infallible, so before I make decisions based on this metric, I'm opening up the opportunity to trash the metric.

This metric underestimates the cost of multiple collisions in a single
bucket. For example, it will give the same result for the case when
you have 98 elements in bucket 1, 2 in bucket 2 as it would for 50
elements in bucket 1, 50 in bucket 2. A better metric might be the
expected number of element lookups to find an element in the table -
it's equal to the sum of (bucket_size(i) * (bucket_size(i) + 1) / 2)
for all buckets divided by the number of elements.

Another metric would be to place a set of elements in the container,
and then measure the number of lookups required for a separate set of
elements. Using real data is also a good idea; randomly generated data
tends to hash to random buckets.